Adding Pinyin to Chinese Lyrics on the RPi

Resources

Pinyin

The Chinese written language uses hieroglyphs such as '好', meaning 'good'. There are over 20,000 of these, but only about 5,000 are in regular use. This is obviously far too many for any keyboard, let alone a mobile phone pad. There is a latin form of each character, using the latin alphabet with accents, called Pinyin. The Pinyin form of '好' is 'hǎo'. A dictionary such as Chinese-English Dictionary wil show these.

A Chinese song will use the Chinese hieroglyhs. I don't recognise them and so can't sing along to the lyrics. I need the Pinyin. So I need my Karaoke program to translate the Chinese into Pinyin and display that along with the Chinese.

Mapping Chinese characters to Pinyin

I couldn't find a list of characters and their corresponding characters. The closest is the Chinese-English Dictionary from which you can download the dictionary as a text file. Typical lines in this file are

	 
不賴 不赖 [bu4 lai4] /not bad/good/fine/
	 
       

Each line has the Traditional characters followed by the Simplified characters, the Pinyin in [...] and then English meanings.

I used the following shell script to make a list of character/PinYin pairs:

	
#!/bin/bash

# get pairs of character + pinyin by throwing away other stuff in the dictionary

awk '{print $2, $3}' cedict_ts.u8 | grep -v '[A-Z]' | 
  grep -v '^.[^ ]' | sed -e 's/\[//' -e 's/\]//' -e 's/[0-9]$//' | 
    sort | uniq -w 1 > pinyinmap.txt
	
      
to give lines such as
	
好 hao
妁 shuo
如 ru
妃 fei
	
      

Building a map

A text file under Linux is now usually in UTF-8 format. We want the characters in UCS format. I use the library by Jeff Bezanson from Unicode in C and C++: What You Can Do About It Today and build a GLib hash table as follows:

void fill_pinyin_map() {
    // pinyinmap.txt is a UTF-8 file with
    // 1st char the chinese char
    // 2nd char a space
    // rest is the pinyin for the char, all ASCII chars
    // Don't convert the rest to UCS, leave as ASCII and
    // convert as needed when used
    FILE *fp = fopen("pinyinmap.txt", "r");
    if (fp == NULL) {
	fprintf(stderr, "Can't find pinyinmap.txt\n");
	exit(1);
    }

    pinyin_map = g_hash_table_new(g_direct_hash, g_direct_equal);

    char line[512];
    VGuint ucs_char;
    int len, ch_len;
    while (fgets(line, 512, fp) != NULL) {

	len = strlen(line);
	if (len >= 2) {
	    line[len-1] = '\0'; // lose '\n'
	}
	ch_len = u8_offset(line, 1);

	// just get 1st char from UTF-8 to UCS
	u8_toucs(&ucs_char, sizeof(VGuint),
		 line,
		 ch_len);
	gchar *pinyin_chs = g_strdup(line + ch_len + 1); // lose space separator
	g_hash_table_insert(pinyin_map, 
			    GINT_TO_POINTER(ucs_char), 
			    pinyin_chs);
	gchar *p = g_hash_table_lookup(pinyin_map, GINT_TO_POINTER(ucs_char));
    }
 }
      
Characters can then be looked up in this map and drawn appropriately.