Upto: Table of Contents of full book "Programming and Using Linux Sound"

Decoding the Malata disk

I have a Malata 6619 Karaoke player. It has about 15,000 songs, but only about 500 English songs. My wife insists that the Chinese songs on it are of far higher quality than on my Sonken. I'm not so sure I agree, but anyway it's time to try to decode its file format.

This is only partially successful: while I have got some information, I can't as yet play a MIDI file from the disk.

Resources

Files

The files on the disk are

AUDIO_TS      MALATAF3.BIN  MALATAS3.IDX  MULTAK.DA4    
F6838.IDX     MALATAJ1.BIN  MALATAS4.IDX  MULTAK.DAT    
FONT.BIN      MALATAM1.INF  
FONTFT.BIN    MALATAM2.INF  MULTAK.DA1    VIDEO_TS
Makefile      MALATAM3.INF  MULTAK.DA2    SEARCH.IDX
MALATAF1.BIN  MALATARM.DAT  MULTAK.DA3    
      

Song titles

The command strings shows that song titles are in several files, including MALATAS4.IDX. This seems to contain the most song titles, so I looked at that.

The block of song titles starts at 0x5F000. Before that is a bunch of nulls. To confirm this, the first song in my Malata songbook is "1001次吵架" and I can see the string "1001" as the first entry in that table.

This first song contains Chinese characters in the title, and the site GB Code Table by Mary Ansell confirms that they are encoded using GB2312.

The song titles are just concatenated, for example as "AlrightAmourAndyAngelAre You ReadyAsk For MoreBABY I'M YOUR MANBACK HOME". So there must be a table somewhere showing start and end of each title. I looked for any table giving offsets of the start of songs from 0x5F000. There is such a table, at 0x800! After playing with that for a while, it turns out that this table consists of records of 25 bytes. I'm not sure of the start of the records from 0x800: if I take a starting offset of one, then bytes 19 and 20 hold the offset into the song title table while byte 25 is the length of the song title. But the offset could be higher.

The language appears to be specified in byte 11:

The song number is a bit sneaky: in bytes 15-17 are 3 hexadecimal numbers. If they are concatenated then they are the song number. For example, for the song Medley One, the record is

00 00 00 6D 6F 00 00 00 00 04 07 20 00 02 02 01 35 01 7C A8 9E 02 02 00 0A
      

Bytes 15, 16, 17 are "02 01 35" (in hex) and this is the song number 20135 for the Beatles "Medley One".

The earliest English song is 20001 "7Days" and the last one is 20501 "Take Me To Your Heart"

We currently have

11    language
15-17 song number
19,20 offset into song title table
21,22 offset into artist name table?
25    length of title
      

e.g. for Medley One

00 00 00 6D 6F 00 00 00 00 04 07 20 00 02 02 01 35 01 7C A8 9E 02 02 00 0A
                              07          02 01 35 01 7C    9E 02
                              La          SongNumb SongT    Artis 
      

I don't at present know what else is in these records.

A program to list the titles is SongTitles.java:



import java.io.FileInputStream;
import java.io.*;
import java.nio.charset.Charset;



class SongTitles {
    //private static int BASE = 0x5F23A;
    private static long OFFSET = 1; // must be >= 1 or blows array bound
    private static long INDEX_BASE = 0x800 + OFFSET;
    private static long TITLE_BASE = 0x5F000;
    private static int NUM = 12000;
    private static int INDEX_SIZE = 25;
    private static int TITLE_OFFSET = (int) (19 - OFFSET);
    private static int LENGTH_TITLE = (int) (25 - OFFSET);
    private static int ARTIST_OFFSET = (int) (21 - OFFSET);
    private static int SONG_INDEX = (int) (15 - OFFSET);

    public static void main(String[] args) throws Exception {
	long nread = INDEX_BASE;
	byte[] titleBytes = new byte[512];
 	FileInputStream fstream = new FileInputStream("MALATAS4.IDX");
	fstream.skip(INDEX_BASE);

	byte[][] indexes = new byte[NUM][];
	long titleStart[] = new long[NUM];
	long titleEnd[] = new long[NUM];
	for (int n = 0; n < NUM; n++) {
	    indexes[n] = new byte[INDEX_SIZE];
	    fstream.read(indexes[n]);
	    if (isNull(indexes[n]))
		break;
	    // printIndex(indexes[n]);
	    nread += INDEX_SIZE;

	    if (isNull(indexes[n]))
		break;

	    byte b1 = indexes[n][TITLE_OFFSET];
	    byte b2 = indexes[n][TITLE_OFFSET+1];
	    // corect for negative
	    int first, second;
	    //System.out.printf("%X %X\n", firstB, secondB);
	    //first = firstB >= 0 ? firstB : 256 - firstB;
	    //second = secondB >= 0 ? secondB : 256 - secondB;

	    titleStart[n] = ((b1 >= 0 ? b1 : 256 + b1) << 8) + (b2 >= 0 ? b2 : 256 + b2); //first * 256 + second;
	    if (titleStart[n] > 0xfff) {
		// System.out.println("too big");
		titleStart[n] &= 0xfff;
	    } else if (titleStart[n] < 0) {
		System.out.println("too small");
	    }

	    long end = indexes[n][LENGTH_TITLE];
	    if (end >= 0x80) {
		end -= 0x80;
		// System.out.println("End too big");
	    } else if (end < 0) {
		// System.out.println("End negative");
		end += 128;
	    }
	    titleEnd[n] = end;

	    /*
	    System.out.printf("Numbers %X %X %X %X\n", b1, b2, 
			      titleStart[n],
			      titleStart[n]+titleEnd[n]);
	    */
	}

	System.out.printf("Skip to %X\n", (TITLE_BASE - nread));
	fstream.skip(TITLE_BASE - nread);
	nread = TITLE_BASE;

	for (int n = 0; n < NUM-1; n++) {
	    // int len = (int) (titleStart[n+1]-titleStart[n]);
	    int len = (int) titleEnd[n];
	    //System.out.println("Reading " + len);
	    fstream.read(titleBytes, 0, len);

	    Charset charset = Charset.forName("gb2312");
	    String translated = new String(titleBytes, 0, len, charset);

	    printFullIndex(indexes[n]);
	    System.out.print(" SongIndex ");
	    printSongIndex(indexes[n]);
	    System.out.print(" ArtistIndex ");
	    printArtistIndex(indexes[n]);
	    System.out.println("" + n + ": " +translated);
	    //printIndex(indexes[n]);
	}

	/*
	fstream.skip(TITLE_BASE);

	byte[] bytes = new byte[NUM];
	fstream.read(bytes);

	Charset charset = Charset.forName("gb2312");
	String translated = new String(bytes, charset);

	System.out.println(translated);
	*/
    }

    private static void printFullIndex(byte[] bytes) {
	for (int n = 0; n < bytes.length; n++) {
	    System.out.printf("%02X ", bytes[n]);
	}
    }

    private static void printArtistIndex(byte[] bytes) {
	System.out.printf("%02X%02X ", 
			  bytes[ARTIST_OFFSET], 
			  bytes[ARTIST_OFFSET+1]);
    }

    private static void printSongIndex(byte[] bytes) {
	System.out.printf("%X%02X%02X ", 
			  bytes[SONG_INDEX], 
			  bytes[SONG_INDEX+1], 
			  bytes[SONG_INDEX+2]);
    }
    
    private static void print1byte(byte[] bytes, int index) {
	System.out.printf("%02X ", bytes[index]);
    }
    
    private static boolean isNull(byte[] bytes) {
	for (int n = 0; n < bytes.length; n++) {
	    if (bytes[n] != 0)
		return false;
	}
	return true;
    }
}
      

One table finishes at 0xF5260, maybe starting another at 0xF5800. Another table starts at 0x88000 and finishes about 0x9ADF0. Another starts at 0x9B000 and finishes at 0x9B2B0 I don't know what is in these tables.

Song data

Most of this section was discovered by thanth. However, he only deals with a single data file, and as the Malata has more, it becomes more complex.

There are four data files: MULTAK.DAT, MULTAK.DA1, MULTAK.DA2 and MULTAK.DA3. The primary data file is MULTAK.DAT, and this contains tables of pointers to song data. The other files seem to just contain the song data.

The number of songs (minus one) is in byte-swapped order at 0x14E in MULTAK.DAT. In my files, this is "FB 3D" which when swapped to "3D FB" is one less than the number of songs, 0x3DFC (15868). This was identified by thanth.

Starting at 0xD20 is a table of 4 byte numbers (prefixed by "FF 00 FF FF" which are indexes into the table of song data. If the bytes are "b0 b1 b2 b3" then thanth discovered that the song data starts at

(((b0 * 0x3C) + b1) * 0x48 +b2) * 0x800 + 0x10000
      

Actually, it is more complex than that: for the songs with data on the first disk MULTAK.DAT this is the case. The table also contains pointers to data in the other files, and for these the formula is

(((b0 * 0x3C) + b1) * 0x48 +b2) * 0x800
      

That is, the data in these later files starts immediately with no offset.

The file for each song is given in the top half of the fourth byte of the song index: (b3 >> 4), where zero is MULTAK.DAT, one is MULTAK.DA1, etc.

At the locations of the song data pointers is either the phrase "OK" which means a "simple song" or "FF FF" which means "complex song", according to thanth. Simple songs just contain lyrics and MIDI data, while complex songs also have MP3 data. I haven't yet found any information about the size of the data for each song.

The program SongData.java splits the MULTAK files into individual song data files. It only saves a part of the data for each song, since I don't know where the data finishes.



import java.io.FileInputStream;
import java.io.*;
import java.nio.charset.Charset;



class SongData {
    private static int MAX_SONGS = 20000;

    private static long OFFSET = 0;
    private static long INDEX_BASE = 0xD20 + OFFSET;

    private static long MULTAK_SIZES[] = {0x2441D800, 
					  943441920,
					  943405056, 
					  943812608,
					  58099712};
    //private long offset = 0x10000;

    private enum SongType {
	SIMPLE, COMPLEX;
    }

    private static long TITLE_BASE = 0x5F000;
    private static int NUM = 12000;
    private static int INDEX_SIZE = 25;
    private static int TITLE_OFFSET = (int) (19 - OFFSET);
    private static int LENGTH_TITLE = (int) (25 - OFFSET);
    private static int ARTIST_OFFSET = (int) (21 - OFFSET);
    private static int SONG_INDEX = (int) (15 - OFFSET);

    private class SongStart {
	public long start;
	public int fileNumber;
	public int songNumber;
	public byte[] indexBytes;
	public byte[] data;
	public SongType type;
    }


    public static void main(String[] args) throws Exception {
	new SongData();
    }

    public SongData() throws Exception {
	long nread = INDEX_BASE;
	long index;
	int[] bytes;
	int[] ibytes = new int[4];
	int numSongs = 0;

	SongStart songStarts[] = new SongStart[MAX_SONGS];

 	FileInputStream fstream = new FileInputStream("MULTAK.DAT");
	fstream.skip(INDEX_BASE);
	bytes = read4(fstream);
	long lval = 0;
	int currentFileNumber = 0;

	while (numSongs < MAX_SONGS) {
	    // System.out.printf("%X\n", index);
	    bytes = read4(fstream);
	    if (isNull(bytes)) {
		System.out.printf("Read %d songs\n", numSongs);
		break;
	    }

	    if (isFF00FFFF(bytes)) {
		// these seem to occur sometimes e.g. at A8B8
		continue;
	    }

	    songStarts[numSongs] = new SongStart();
	    songStarts[numSongs].songNumber = numSongs;
	    songStarts[numSongs].fileNumber = bytes[3] >> 4;
	    if ((bytes[3] & 0xF) == 0xA) {
		songStarts[numSongs].type = SongType.COMPLEX;
	    } else {
		songStarts[numSongs].type = SongType.SIMPLE;
	    }
	    songStarts[numSongs].start = songStart(bytes);

	    /*
	    // update fileNumber number?
	    if (numSongs > 0 && 
		songStarts[numSongs-1].start > songStarts[numSongs].start) {

		offset = 0;
		// may need to reset this since offset may have changed
		songStarts[numSongs].start = songStart(bytes);
		currentFileNumber++;
	    }
	    */

	    // songStarts[numSongs].fileNumber = currentFileNumber;
	    System.out.printf("Song %d starts %X fileNumber %X (", 
			      numSongs, 
			      songStarts[numSongs].start,
			      songStarts[numSongs].fileNumber);
	    printBytes(bytes);
	    System.out.println(')');

	    numSongs++;
	}
	fstream.close();

	for (int n = 0; n < numSongs; n++) {
	    System.out.printf("Number %d start %X bytes ",
			      n, songStarts[n].start);
	    getSongFromStart(songStarts[n]);
	    System.out.println();

	    saveSong(songStarts[n]);
	}
	    
	/*
	fstream =  new FileInputStream("MULTAK.DAT");
	long totalRead = 0;
	for (int n = 0; n < MAX_SONGS; n++) {
	    fstream.skip(songStarts[n] - totalRead);
	    totalRead = songStarts[n];
	    System.out.printf("Skipped to %X %X\n", songStarts[n], totalRead);
	    // check next song
	    bytes = read4(fstream);
	    totalRead += 4;

	    for (n = 0; n < 4; n++) {
		System.out.printf("   %X\n", bytes[n]);
		ibytes[n] = (bytes[n] >= 0 ? bytes[n] : 256 + bytes[n]);
	    }

	    lval = 0;
	    for (n = 0; n < 4; n++) {
		lval = (lval << 8) + bytes[n];
	    }
	    System.out.printf("  next bytes %X\n", lval);
	}  
	*/

    }

    private int[] read4(FileInputStream f) throws Exception {
	byte[] bytes = new byte[4];
	int[] ibytes = new int[4];
	long ret = 0;

	f.read(bytes);
	
	// ensure they are unsigned bytes
	for (int n = 0; n < 4; n++) {
	    ibytes[n] = (bytes[n] >= 0 ? bytes[n] : 256 + bytes[n]);
	}
	return ibytes;
    }

    private long songStart(int[] indexBytes) {
	long offset;

	int fileNumber = indexBytes[3] >> 4;
	if (fileNumber == 0)
	    offset = 0x10000;
	else
	    offset = 0;

	long idx =  ((indexBytes[0] * 0x3C) + 
		     indexBytes[1]) * 0x4B + indexBytes[2];
	return idx * 0x800 + offset;
    }

    private void getSongFromStart(SongStart songInfo) throws Exception {
	String fname;
	
	if (songInfo.fileNumber == 0) {
	    fname = "MULTAK.DAT";
	} else {
	    fname = "MULTAK.DA" + songInfo.fileNumber;
	}

 	FileInputStream fstream = new FileInputStream(fname);
	fstream.skip(songInfo.start);
	int[] bytes = read4(fstream);
	if (bytes[0] == 0xFF && bytes[1] == 0xFF) {
	    songInfo.type = SongType.COMPLEX;
	} else {
	    songInfo.type = SongType.SIMPLE;
	}
	songInfo.data = new byte[0x4B00];
	fstream.read(songInfo.data);
	fstream.close();
    }

    private void saveSong(SongStart songInfo) throws Exception {
	String fname = "songs/" + songInfo.songNumber;
 	FileOutputStream fstream = new FileOutputStream(fname);
	if (songInfo.type == SongType.SIMPLE)
	    fstream.write(new byte[] {0, 0, 'O', 'K'});
	else
	    fstream.write(new byte[] {(byte)0xFF, (byte)0xFF, 0, 0});

	fstream.write(songInfo.data);
	fstream.close();
    }

    private boolean greaterThanEqual(int[] x, int[] y) {
	for (int n = 0; n < x.length; n++) {
	    if (x[n] < y[n]) {
		return false;
	    }
	}
	return true;
    }

    private void printFullIndex(byte[] bytes) {
	for (int n = 0; n < bytes.length; n++) {
	    System.out.printf("%02X ", bytes[n]);
	}
    }

    private void printArtistIndex(byte[] bytes) {
	System.out.printf("%02X%02X ", 
			  bytes[ARTIST_OFFSET], 
			  bytes[ARTIST_OFFSET+1]);
    }

    private void printSongIndex(byte[] bytes) {
	System.out.printf("%X%02X%02X ", 
			  bytes[SONG_INDEX], 
			  bytes[SONG_INDEX+1], 
			  bytes[SONG_INDEX+2]);
    }
    
    private void print1byte(byte[] bytes, int index) {
	System.out.printf("%02X ", bytes[index]);
    }

    private void printBytes(int[] bytes) {
	for (int n = 0; n < bytes.length; n++) {
	    System.out.printf("%02X ", bytes[n]);
	}
    }
    
    private boolean isNull(int[] bytes) {
	for (int n = 0; n < bytes.length; n++) {
	    if (bytes[n] != 0)
		return false;
	}
	return true;
    }

    private boolean isFF00FFFF(int[] bytes) {
	if (bytes[0] == 0xFF &&
	    bytes[2] == 0xFF &&
	    bytes[3] == 0xFF)
	    return true;
	else
	    return false;
    }
}
      

Linking songs in the song book to song data

The file MULATAS4.IDX, we have determined, holds a table of 25 byte blocks starting at 0x800. The current knowledge of these blocks is

11 language (00 is GB2132, 07 is English)
15-17 song number in song book (read the hex number as decimal)
19-20 offset of song title into table at 0x5F000
25 length of title

The song data is spread across files MULTAK.DAT - MULTAK.DA4. At 0xd20 in MULTAK.DAT is a table of 4 bytes indexes. Each index translates into a starting location in the various data files. These are of type "simple" (probably MIDI data only) and "complex" (MIDI plus MP3).

The problem now is to link the song information to its data.

Using the starting points of each data file, we can extract, say, a few thousand bytes from each file. The correct length is at present unknown. I saved them them as files 1, 2, 3, ..., 15460.

With them all in separate files, I tried to see if any of them were recognisable. Completely by fluke by running strings looking for Fool on the Hill (which I knew was on the disk), I hit upon song 10397 by searching for "hill", with bvi showing

00000000  00 00 4F 4B 00 00 00 00 00 00 00 00 00 00 00 00 ..OK............
00000010  00 00 00 00 00 20 00 00 00 07 00 00 00 00 00 00 ..... ..........
00000020  00 00 91 06 00 00 20 11 00 00 77 48 41 54 00 64 ...... ...wHAT.d
00000030  52 45 41 4D 53 00 61 52 45 00 6D 41 44 45 00 6F REAMS.aRE.mADE.o
00000040  46 0F 6F 72 69 67 69 6E 61 6C 1A 0F 68 49 4C 4C F.original..hILL
00000050  41 52 59 00 64 55 46 46 20 20 20 52 20 25 C8 2E ARY.dUFF   R %..
      

This song isn't in my song book, but it is in the list I pulled out of MALATAS4.IDX:

00 00 00 77 64 61 6D 6F 00 04 07 00 00 05 02 04 42 01 87 CE F1 02 05 80 17  SongIndex 20442  ArtistIndex F102 10415: What Dreams Are Made Of
      

So song title 20442 has its song data in file 10397.

With this clue, doing something like strings -f -n 30 * quickly shows up other files with english text. Enough to draw up a table

ID Song # in book Index to data Song title
S1 20247 10202 NEXT 100 YEARS
S2 20428 10383 Lovers
S3 20442 10397 What dreams are...
S4 20154 10109 Only sleeping

and this is just linear:

data index = song number - 10045
      

NB: this only works for some songs - I guess the English ones! If this pattern holds for all English songs, the earliest data file is (20001-10045) = 9956 and the last one is (20501-10045) = 10456.

Decoding song data

It would be easy if all songs contained English text in the data. But I only found the above four. So the rest must be encoded in some way.

A couple that I looked at seemed a bit messy. So I settled (for no particular reason) on Don Gibson's "Oh Lonesome Me". By playing the song on the Malata, the headers were

Oh Lonesome Me
ORIGINAL:
Don Gibson
      

while the lyrics are

Every body's goin'
out and havin' fun
I'm just a fool
for stayin' home
and havinnone
      

Running bvi on the data file gives

00000000  00 00 4F 4B 00 00 00 00 00 00 00 00 00 00 00 00 ..OK............
00000010  00 00 00 00 00 20 00 00 00 07 00 00 00 00 00 00 ..... ..........
00000020  00 00 6D 05 00 00 33 0F 00 00 7C 5B 13 7F 5C 5D ..m...3...|[..\]
00000030  56 40 5C 5E 56 13 7E 56 1C 7C 61 7A 74 7A 7D 72 V@\^V.~V.|aztz}r
00000040  7F 09 1C 77 5C 5D 13 74 5A 51 40 5C 5D 1C 33 33 ...w\].tZQ@\].33
00000050  33 1F 33 36 27 37 33 33 32 33 33 33 05 30 30 32 3.36'7332333.002
00000060  33 33 3D 0B 35 3A 32 33 33 7F 17 37 3C 32 33 32 33=.5:233..7<232
      

From the songs with English, I know the song title starts at 0x2A, and we have to do this match:

7C 5B 13 7F 5C 5D 56 40 5C 5E 56 13 7E 56 1C 7C 61 7A 74 7A 7D 72
O  h     L  o  n  e  s  o  m  e     M  e  /  O  R  I  G  I  N  A

7F 09 1C 77 5C 5D 13 74 5A 51 40 5C 5D 1C 33 33
L  :  /  D  o  n     G  i  b  s  o  n  /
      

The obvious thing to try is a substitution cipher: for example, the 'o's are encoded as 0x5c while the 'O's are 0x7c. It's a game of pattern matching, and the answer is the following piece of C code

#include <stdio.h>

int main(int argc, char **argv) {
    FILE *fp;
    FILE *ofp;

    if (argc == 1) {
	fp = stdin;
	ofp = stdout;
    } else if (argc == 2) {
	fp = fopen(argv[1], "r");
	ofp = stdout;
    } else {
	fp = fopen(argv[1], "r");
	ofp = fopen(argv[2], "w");
    }

    char ch;
    int n = 0;
    
    while (n++ < 1500) {
	ch = getc(fp);

	switch (ch) {
	case 0x13: ch = ' '; break;

	case 0x52: ch = 'a'; break;
	case 0x51: ch = 'b'; break;
	case 0x50: ch = 'c'; break;

	case 0x57: ch = 'd'; break;
	case 0x56: ch = 'e'; break;
	case 0x55: ch = 'f'; break;
	case 0x54: ch = 'g'; break;

	case 0x5B: ch = 'h'; break;
	case 0x5A: ch = 'i'; break;
	case 0x59: ch = 'j'; break;
	case 0x58: ch = 'k'; break;

	case 0x5F: ch = 'l'; break;
	case 0x5E: ch = 'm'; break;
	case 0x5D: ch = 'n'; break;
	case 0x5C: ch = 'o'; break;

	case 0x43: ch = 'p'; break;
	case 0x42: ch = 'q'; break;
	case 0x41: ch = 'r'; break;
	case 0x40: ch = 's'; break;

	case 0x47: ch = 't'; break;
	case 0x46: ch = 'u'; break;
	case 0x45: ch = 'v'; break;
	case 0x44: ch = 'w'; break;

	case 0x4A: ch = 'y'; break;

	    // These aren't organised yet
	    //case 0x: ch = ''; break;

	case 0x72: ch = 'A'; break;
	case 0x71: ch = 'B'; break;
	case 0x70: ch = 'C'; break;

	case 0x77: ch = 'D'; break;
	case 0x76: ch = 'E'; break;
	case 0x75: ch = 'F'; break;
	case 0x74: ch = 'G'; break;

	case 0x7B: ch = 'H'; break;
	case 0x7A: ch = 'I'; break;
	case 0x79: ch = 'J'; break;
	case 0x78: ch = 'K'; break;

	case 0x7F: ch = 'L'; break;
	case 0x7E: ch = 'M'; break;
	case 0x7D: ch = 'N'; break;
	case 0x7C: ch = 'O'; break;

	case 0x63: ch = 'P'; break;
	case 0x62: ch = 'Q'; break;
	case 0x61: ch = 'R'; break;
	case 0x60: ch = 'S'; break;

	case 0x67: ch = 'T'; break;
	case 0x66: ch = 'U'; break;
	case 0x65: ch = 'V'; break;
	case 0x64: ch = 'W'; break;

	case 0x6B: ch = 'X'; break;
	case 0x6A: ch = 'Y'; break;
	case 0x69: ch = 'Z'; break;

	case 0x09: ch = ':'; break;

	case 0x14: ch = '\''; break;

	default: ch = '.'; break;
	}
	putc(ch, ofp);
    }
    fclose(ofp);
    exit(0);
}

      

The substitutions here are grouped in fours, but of course there is no reason why they should be. (This is repeated in some other decodings, but not all.).

Following application of that substitution, the file looks like

00000000  2E 2E 2E 2E 2E 2E 2E 2E 2E 2E 2E 2E 2E 2E 2E 2E ................
00000010  2E 2E 2E 2E 2E 2E 2E 2E 2E 2E 2E 2E 2E 2E 2E 2E ................
00000020  2E 2E 2E 2E 2E 2E 2E 2E 2E 2E 4F 68 20 4C 6F 6E ..........Oh Lon
00000030  65 73 6F 6D 65 20 4D 65 2E 4F 52 49 47 49 4E 41 esome Me.ORIGINA
00000040  4C 3A 2E 2E 6F 6E 20 47 69 62 73 6F 6E 2E 2E 2E L:..on Gibson...
00000050  2E 2E 2E 2E 2E 2E 2E 2E 2E 2E 2E 2E 2E 2E 2E 2E ................
00000060  2E 2E 2E 2E 2E 2E 2E 2E 2E 4C 2E 2E 2E 2E 2E 2E .........L......
00000070  2E 2E 2E 2E 2E 2E 2E 2E 2E 2E 2E 2E 2E 2E 2E 2E ................
00000080  2E 2E 2E 2E 2E 2E 2E 2E 2E 2E 2E 2E 2E 2E 2E 2E ................
00000090  45 2E 76 2E 65 2E 72 2E 79 2E 20 2E 62 2E 6F 2E E.v.e.r.y. .b.o.
000000A0  64 2E 79 2E 27 2E 73 2E 20 2E 67 2E 6F 2E 69 2E d.y.'.s. .g.o.i.
000000B0  6E 2E 27 2E 2E 2E 2E 2E 2E 2E 2E 2E 2E 49 2E 27 n.'..........I.'
000000C0  2E 6D 2E 20 2E 6A 2E 75 2E 73 2E 74 2E 20 2E 61 .m. .j.u.s.t. .a
000000D0  2E 20 2E 66 2E 6F 2E 6F 2E 6C 2E 2E 2E 2E 2E 2E . .f.o.o.l......
000000E0  2E 2E 2E 61 2E 6E 2E 64 2E 20 2E 68 2E 61 2E 76 ...a.n.d. .h.a.v
000000F0  2E 2E 2E 69 2E 6E 2E 6E 2E 6F 2E 6E 2E 65 2E 2E ...i.n.n.o.n.e..
00000100  2E 2E 2E 2E 2E 2E 2E 2E 68 2E 6F 2E 77 2E 20 2E ........h.o.w. .
00000110  2E 2E 73 2E 68 2E 65 2E 20 2E 73 2E 65 2E 74 2E ..s.h.e. .s.e.t.
00000120  20 2E 2E 2E 6D 2E 65 2E 20 2E 66 2E 72 2E 65 2E  ...m.e. .f.r.e.
00000130  65 2E 2E 2E 2E 2E 2E 2E 2E 2E 2E 54 2E 68 2E 61 e..........T.h.a
00000140  2E 74 2E 20 2E 6D 2E 69 2E 73 2E 74 2E 61 2E 6B .t. .m.i.s.t.a.k
      

Which is much more readable! it doesn't quite follow the lyrics though - an issue for later.

The next occurrence of this substitution cipher is at file 10281, song number 20326 "Give me all night" so there are other substitutions used, of course! Then at 10326, song number 20371 "Stand By Your Man"

So then I tried another song, California dreaming, song 20088, file 10043. That was pretty straightforward. Other songs using this substitution are 20033 Heartbreaker, file 9988, 20082 Another Girl, file 10037, 20213 Don't talk, file 10168, 20382 Things we Said Today, file 10336. I also tried some others: no pattern as to song numbers.

However, the label for the substitution pattern appears to be byte 0x26. Here we another coincidence: the byte which is mapped to the space character ' ' is 0x20 less than byte 0x26. e.g. the substitution for songs like file 10337 is

        switch (ch) {
        case 0x95: ch = ' '; break;

        case 0xD4: ch = 'a'; break;
        case 0xD7: ch = 'b'; break;
        case 0xD6: ch = 'c'; break;

        case 0xD1: ch = 'd'; break;
        case 0xD0: ch = 'e'; break;
        case 0xD3: ch = 'f'; break;
        case 0xD2: ch = 'g'; break;

        case 0xDD: ch = 'h'; break;
        case 0xDC: ch = 'i'; break;
        case 0xDF: ch = 'j'; break;
        case 0xDE: ch = 'k'; break;

        case 0xD9: ch = 'l'; break;
        case 0xD8: ch = 'm'; break;
        case 0xDB: ch = 'n'; break;
        ...
      

and byte 0x26 for that file is 0xB5, and 0x20 = 0xB5 - 0x95. Maybe the pattern is just based on these bytes (e.g. what is byte 0x27 - an index into pattern types?).

To be followed up...

There is one critival issue in this: it isn't only the English alphabetic characters that are encoded: others are too. But I have no clues as to what the other 65000+ Unicode characters should be!

MIDI from data file

Here is where I am currently stuck. The data files are not MIDI files. For example, song 10383 Lovers by Abba is

00000000  00 00 4F 4B 00 00 00 00 00 00 00 00 00 00 00 00 ..OK............
00000010  00 00 00 00 00 24 00 00 00 07 00 00 00 00 00 00 .....$..........
00000020  00 00 77 07 00 00 00 12 00 00 4C 6F 76 65 72 73 ..w.......Lovers
00000030  2F 4F 52 49 47 49 4E 41 4C 3A 2F 41 62 62 61 2F /ORIGINAL:/Abba/
00000040  28 4C 69 76 65 20 41 20 4C 69 74 74 6C 65 20 4C (Live A Little L
00000050  6F 6E 67 65 72 29 2F 00 00 00 56 00 06 EB 0A 00 onger)/...V.....
00000060  00 01 00 00 00 36 05 01 01 00 00 17 2A 06 02 01 .....6......*...
00000070  00 00 38 2A 04 03 01 00 00 46 36 05 04 01 00 00 ..8*.....F6.....
00000080  88 39 05 05 01 00 00 A8 36 05 06 01 00 00 F1 2A .9......6......*
00000090  04 07 01 00 01 16 53 02 08 01 00 01 24 42 04 09 ......S.....$B..
000000A0  01 00 01 32 23 06 11 CA 00 0D 5B 00 0D 62 81 80 ...2#.....[..b..
000000B0  05 82 9D 09 30 07 18 07 18 07 18 07 18 07 00 26 ....0..........&
000000C0  01 5C 00 53 04 69 04 74 06 20 00 64 07 6F 05 77 .\.S.i.t. .d.o.w
000000D0  05 6E 07 20 00 61 09 6E 07 64 09 20 00 6C 0B 69 .n. .a.n.d. .l.i
000000E0  09 73 08 74 08 65 09 6E 08 5E 01 00 1D 01 1E 26 .s.t.e.n.^.....&
      

This isn't a MIDI file. From my Sonken, I have a MIDI file for this song (obviously not the same recording) which looks like

00000000  4D 54 68 64 00 00 00 06 00 01 00 03 00 1E 4D 54 MThd..........MT
00000010  72 6B 00 00 00 2B 00 FF 03 0C 53 6F 66 74 20 4B rk...+....Soft K
00000020  61 72 61 6F 6B 65 00 FF 01 13 40 4B 4D 49 44 49 araoke....@KMIDI
00000030  20 4B 41 52 41 4F 4B 45 20 46 49 4C 45 00 FF 2F  KARAOKE FILE../
00000040  00 4D 54 72 6B 00 00 1A 5E 00 FF 01 05 40 4C 45 .MTrk...^....@LE
00000050  4E 47 00 FF 01 1E 40 54 4C 6F 76 65 72 73 28 4C NG....@TLovers(L
00000060  69 76 65 20 61 20 4C 69 74 74 6C 65 20 4C 6F 6E ive a Little Lon
00000070  67 65 72 29 00 FF 01 02 40 54 8A 5B FF 01 02 5C ger)....@T.[...\
00000080  53 06 FF 01 01 69 06 FF 01 01 74 09 FF 01 01 20 S....i....t....
00000090  06 FF 01 01 64 06 FF 01 01 6F 06 FF 01 01 77 06 ....d....o....w.
000000A0  FF 01 01 6E 07 FF 01 01 20 07 FF 01 01 61 07 FF ...n.... ....a..
000000B0  01 01 6E 07 FF 01 01 64 0A FF 01 01 20 07 FF 01 ..n....d.... ...
000000C0  01 6C 07 FF 01 01 69 07 FF 01 01 73 0A FF 01 01 .l....i....s....
000000D0  74 0A FF 01 01 65 0A FF 01 01 6E 00 FF 01 01 5C t....e....n....\
      

and this is a conforming MIDI file.

Just looking at the lyric part, MIDI files require

<delta> FF 01 <string length>
      

It is likely that the Malata just has

<delta>  single-char
      

That's easy to adjust - if we know where they lyrics stop!

The lyrics start after a sequence "18 07 18 07 18 07 18 07 00 26".

A much trickier problem is that the lyrics are not contiguous! They should be

Sit down and listen
'cause I've got
good news for you
It was in the
papers today
      

But if we look at the file, lines 2 and 4 are missing:

000000C0  01 5C 00 53 04 69 04 74 06 20 00 64 07 6F 05 77 .\.S.i.t. .d.o.w
000000D0  05 6E 07 20 00 61 09 6E 07 64 09 20 00 6C 0B 69 .n. .a.n.d. .l.i
000000E0  09 73 08 74 08 65 09 6E 08 5E 01 00 1D 01 1E 26 .s.t.e.n.^.....&
000000F0  01 67 05 6F 05 6F 05 64 07 20 00 6E 08 65 05 77 .g.o.o.d. .n.e.w
00000100  04 73 07 20 00 66 09 6F 07 72 09 20 00 79 09 6F .s. .f.o.r. .y.o
00000110  05 75 06 5E 01 00 10 01 33 26 01 70 0D 61 0C 70 .u.^....3&.p.a.p
      

For the missing line 2, there is some sort of pointer "09 6E 08 5E 01 00 1D 01 1E 26", and it turns out that the missing lines are elsewhere:

00000790  68 19 5E 01 00 81 C9 1A FF 85 21 26 02 27 02 63 h.^.......!&.'.c
000007A0  03 61 02 75 03 73 02 65 04 20 00 49 04 27 04 76 .a.u.s.e. .I.'.v
000007B0  03 65 05 20 00 67 0A 6F 08 74 08 5E 02 00 10 02 .e. .g.o.t.^....
000007C0  4C 26 02 49 06 74 08 20 00 77 09 61 07 73 09 20 L&.I.t. .w.a.s.
000007D0  00 69 07 6E 06 20 00 74 07 68 04 65 04 5E 02 00 .i.n. .t.h.e.^..
      

The Malata displays two lines at a time. All of the first lines form a chunk. The second lines form another chunk, later on. I haven't found an offset or length to say where this second chunk occurs. Each line appears to consist of a delta value (I guess) for the delay of the lyric, followed by a lyric character.

Between lines is a section that starts with the character '^' and finishes with the character '&'. I haven't any idea what is in these sections apart from two observations:

The file printLyrics.c gives a dump fo the lyrics for songs like Lovers (non-coded lyrics) plus the deltas (assumed) and the stuff between lines

#include <stdio.h>

#define NUM_LINES 100
#define LINE_LEN 25

typedef enum {BEFORE_SONG, IN_SONG_AND_LINE, IN_SONG_BETWEEN} state_t;

state_t state = BEFORE_SONG;
int half = 1;

FILE *fp;
FILE *ofp;

unsigned char prev_ch, curr_ch;
unsigned char lines[NUM_LINES][LINE_LEN];
unsigned char separator[NUM_LINES][LINE_LEN];
unsigned char deltas[NUM_LINES][LINE_LEN];

unsigned char *first_lines, 
    *second_lines,
    *first_separators,
    *second_separators,
    *first_deltas,
    *second_deltas;

int max_lines;

void read1() {
    prev_ch = curr_ch;
    curr_ch = getc(fp);
    // fprintf(ofp, "%X %X\n", prev_ch, curr_ch);
}

void read2() {
  prev_ch = getc(fp);  
  curr_ch = getc(fp);
    // fprintf(ofp, "%X %X\n", prev_ch, curr_ch);
}


void printLine(unsigned char *line) {
    fprintf(ofp, "%2d: ", (line - lines[0])/LINE_LEN);
    int m = 0;
    for (m = 0; m < LINE_LEN; m++) {
	if (line[m] != 0) {
	    putc(line[m], ofp);
	} else {
	    putc(' ', ofp);
	}
    }
}

void printDeltas(unsigned char *deltas) {
    int m, max;

    for (max = LINE_LEN-1; max >= 0; max--) {
	if (deltas[max] != 0)
	    break;
    }
	       
    for (m = 0; m <= max; m++) {
	fprintf(ofp, "%2X ", deltas[m]);
    }

    for (m = max+1; m < LINE_LEN; m++) {
	fprintf(ofp, "   ");
    }	
}

void printSeparator(unsigned char *sep) {
    int m, max;

    for (max = LINE_LEN-1; max >= 0; max--) {
	if (sep[max] != 0)
	    break;
    }
	       
    for (m = 0; m <= max; m++) {
	fprintf(ofp, "%2X ", sep[m]);
    }
    
}

int endSection(unsigned char *sep) {
    // does this sep end with 00 26, a break in the music?
    int m, max;

    for (max = LINE_LEN-1; max >= 0; max--) {
	if (sep[max] != 0)
	    break;
    }
	       
    if ((max >= 1) && (sep[max] == 0x26) && (sep[max-1] == 0))
	return 1;
    return 0;
}

int main(int argc, char **argv) {
    int inSong = 0;
    // int lineNo = 0;
    int inLine = 0;

    bzero(lines, NUM_LINES*LINE_LEN* sizeof(unsigned char));
    bzero(separator, NUM_LINES*LINE_LEN* sizeof(unsigned char));
    bzero(deltas, NUM_LINES*LINE_LEN* sizeof(unsigned char));

    int lineNo = 0;
    int charNo = 0;

    if (argc == 1) {
	fp = stdin;
	ofp = stdout;
    } else if (argc == 2) {
	fp = fopen(argv[1], "r");
	ofp = stdout;
    } else {
	fp = fopen(argv[1], "r");
	ofp = fopen(argv[2], "w");
    }

    int n = 0;
    
    while ((n++ < 6000) && (half <= 2)) {
	
	switch (state) {
	case BEFORE_SONG:
	    read2();
	    if ((prev_ch == 0) && (curr_ch == '&')) {
		state = IN_SONG_AND_LINE;
	    }
	    break;
	case IN_SONG_AND_LINE:
	    read2();
	    if (curr_ch == '^') {
		state = IN_SONG_BETWEEN;
		putc('\n', ofp);

		fprintf(ofp, "%d: ", lineNo);

		charNo = 0;
		separator[lineNo][charNo++] = prev_ch;
		separator[lineNo][charNo++] = curr_ch;


	    } else {
		if (curr_ch != 0) {
		    putc(curr_ch, ofp);
		    lines[lineNo][charNo] = curr_ch;
		    deltas[lineNo][charNo] = prev_ch;
		    charNo++;
		}
	    }
	    break;
	case IN_SONG_BETWEEN:
	    read1();
	    if (curr_ch == '&') {
		state = IN_SONG_AND_LINE;

		separator[lineNo][charNo++] = curr_ch;

		charNo = 0;
		lineNo += 1;
	    }  else if (curr_ch == 0xFF) {
		if (half == 1) {
		    // the 2nd half
		    fprintf(ofp, "\n\nStarting 2nd half\n\n");
		    // discard extra
		    //getc(fp);
		    //lineNo = -1;
		    max_lines = lineNo;
		    half = 2;
		} else {
		    half = 3;
		}
	    } else {
		separator[lineNo][charNo++] = curr_ch;
	    }
	    break;
	}
    }

    fprintf(ofp, "\n\nDumping lines\n\n");
    
    first_lines = lines[0];
    second_lines = lines[0] + (max_lines+1) * LINE_LEN;

    first_deltas = deltas[0];
    second_deltas = deltas[0] + (max_lines+1) * LINE_LEN;

    first_separators = separator[0];
    second_separators = separator[0] + (max_lines+1) * LINE_LEN;

    for (n = 0; n < max_lines; n++) {
	fprintf(ofp, "%2d: ", n);
	/*
	if (lines[n][0] == 0) {
	    break;
	}
	*/
	//printLine(lines[n]);
	printLine(first_lines);
	first_lines += LINE_LEN;

	fprintf(ofp, "\n   ");

	//printDeltas(deltas[n]);
	printDeltas(first_deltas);
	first_deltas += LINE_LEN;

	//printSeparator(separator[n]);
	printSeparator(first_separators);
	first_separators += LINE_LEN;
	putc('\n', ofp);

	if (endSection(first_separators - LINE_LEN)) {
	    fprintf(ofp, "Break occurring in first line\n");
	    //continue;
	}

	fprintf(ofp, "%2d: ", n + max_lines + 1);
	//printLine(lines[n + max_lines + 1]);
	printLine(second_lines);
	second_lines += LINE_LEN;


	fprintf(ofp, "\n   ");
	//printDeltas(deltas[n + max_lines + 1]);
	printDeltas(second_deltas);
	second_deltas += LINE_LEN;

	//printSeparator(separator[n + max_lines + 1]);
	printSeparator(second_separators);
	second_separators += LINE_LEN;

	if (endSection(second_separators - LINE_LEN)) {
	    fprintf(ofp, "Break occurring in second line\n");
	    //continue;
	}

	/*
	for (m = 0; m < LINE_LEN/2; m++) {
	    fprintf(ofp, "%2X ", separator[n][m]);
	}
	*/
	putc('\n', ofp);
    }
    

    fclose(ofp);
    exit(0);
}


      

it prints out stuff like

 0:  0: \Sit down and listen     
    1  0  4  4  6  0  7  5  5  7  0  9  7  9  0  B  9  8  8  9                 8 5E  1  0 1D  1 1E 26 
47: 47: 'cause I've got          
    2  2  3  2  3  2  4  0  4  4  3  5  0  A  8                                8 5E  2  0 10  2 4C 26 
 1:  1: good news for you        
    1  5  5  5  7  0  8  5  4  7  0  9  7  9  0  9  5                          6 5E  1  0 10  1 33 26 
48: 48: It was in the            
    2  6  8  0  9  7  9  0  7  6  0  7  4                                      4 5E  2  0 10  2 80 C8 26 
 2:  2: papers today             
    1  D  C  5  5  5  7  0 14 10 11 10                                        11 5E  1  0 10  1 80 A7 26 
49: 49: Some physician           
    2  5  5  5  7  0  B  9  8  D  C  6  7  6                                   6 5E  2  0 10  2 80 96 26 
 3:  3: had made a discovery     
    1  4  5  6  0  5  3  4  4  0 1B  0  E  6  5  8  9  8 11  8                 4 5E  1  0 10  1 3F 26 
      

That's the data - not sure what information it is conveying.

Conclusion

The Malata uses a more complex encoding than the Sonken. This is only partially solved at present.


Copyright © Jan Newmarch, jan@newmarch.name
Creative Commons License
"Programming and Using Linux Sound - in depth" by Jan Newmarch is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License .
Based on a work at https://jan.newmarch.name/LinuxSound/ .

If you like this book, please contribute using PayPal

Or Flattr me:
Flattr this book