Stardict is an awesome dictionary utility for linux, with several dictionaries available for mandarin.
I started to learn cantonese recently, and I couldnt find any stardict dictionaries for cantonese.
There is a plugin for Firefox called Cantofish which uses the dictionary data from CantoDict, and adsotrans.
Unfortunately, for some reason CantoFish doesnt run on my machine.
And also, stardict’s mouse-over translation is really awesome.
So I took a look at converting Cantofish’s dictionary into stardict format, which turned out to be pretty easy.
Cantofish’s dictionary is in the firefox profile, in extensions/cantofish@cantofish.net/chrome/content/canto.dat
As far as I know, this data is available under the GPL, or possibly under a non-commercial attribution license (eg adso).
canto.dat is a tab separated text file, which is really easy to read. I was pleasantly surprised by this!
Then, stardict dictionaries can be created using tabfile, which is in stardict-tools, from an appropriate tab-separated file.
I used the following script to convert canto.dat into an input file for tabfile, and then the rest is easy:
#!/usr/bin/python
import sys
import os
import string
def go( cantopath, outpath ):
print cantopath
cantofd = open( cantopath, "r" )
outfd = open( outpath, "w")
firstline = True
for line in cantofd.readlines():
if not firstline:
line = line.strip()
#print line
cantocharacters = line.split(" ")[0]
#print cantocharacters
cantopronunciation = line.split("[")[1].split("]")[0].strip()
#print cantopronunciation
trans = string.join( line.split("]")[2:],' ').replace('/', '\n').strip().replace('\n', '\\n')
#print trans
outfd.write( cantocharacters + '\t' + cantopronunciation + '\\n' + trans + '\n' )
firstline = False
outfd.close()
cantofd.close()
go( sys.argv[1], sys.argv[2] )
The script expects the path of canto.dat as the first argument, and the name of the output file as the second.
Then you can just process the output using tabfile, and copy the resulting files into an appropriate subfolder of /usr/share/stardict/dic/dic.