Archive for April, 2010

Converting CantoFish dictionary for stardict

Sunday, April 11th, 2010

Stardict is an awesome dictionary utility for linux, with several dictionaries available for mandarin.

I started to learn cantonese recently, and I couldnt find any stardict dictionaries for cantonese.

There is a plugin for Firefox called Cantofish which uses the dictionary data from CantoDict, and adsotrans.

Unfortunately, for some reason CantoFish doesnt run on my machine.

And also, stardict’s mouse-over translation is really awesome.

So I took a look at converting Cantofish’s dictionary into stardict format, which turned out to be pretty easy.

Cantofish’s dictionary is in the firefox profile, in extensions/cantofish@cantofish.net/chrome/content/canto.dat

As far as I know, this data is available under the GPL, or possibly under a non-commercial attribution license (eg adso).

canto.dat is a tab separated text file, which is really easy to read. I was pleasantly surprised by this!

Then, stardict dictionaries can be created using tabfile, which is in stardict-tools, from an appropriate tab-separated file.

I used the following script to convert canto.dat into an input file for tabfile, and then the rest is easy:

#!/usr/bin/python

import sys
import os
import string

def go( cantopath, outpath ):
	print cantopath
	cantofd = open( cantopath, "r" )
	outfd = open( outpath, "w")
	firstline = True
	for line in cantofd.readlines():
		if not firstline:
			line = line.strip()
			#print line
			cantocharacters = line.split(" ")[0]
			#print cantocharacters
			cantopronunciation = line.split("[")[1].split("]")[0].strip()
			#print cantopronunciation
			trans = string.join( line.split("]")[2:],' ').replace('/', '\n').strip().replace('\n', '\\n')
			#print trans
			outfd.write( cantocharacters + '\t' + cantopronunciation + '\\n' + trans + '\n' )
		firstline = False
	outfd.close()
	cantofd.close()

go( sys.argv[1], sys.argv[2] )

The script expects the path of canto.dat as the first argument, and the name of the output file as the second.

Then you can just process the output using tabfile, and copy the resulting files into an appropriate subfolder of /usr/share/stardict/dic/dic.

Finally, Karmic kernel no longer panics on my eeepc

Thursday, April 8th, 2010

Finally, nearly six months after release, the latest Karmic kernel has not panicked in over a week now. It generally panicked whenever I turned off my rt2860sta wifi on my eeepc 901, or every few times I did so. It seems like the 2.6.31-20-generic kernel has the relevant patches applied to it.

Created a test web page for my web site

Monday, April 5th, 2010

To make it easy to check whether my website is functioning, I built a python webpage to download each sub-site, and check it contains some appropriate text, and display the result.

The test page is here:

web site test page

The source code is something like:

def checksite(sitename, url, checktext):
   serverrequesthandle = urllib.urlopen(url, None )
   serverrequestarray = serverrequesthandle.readlines()
   serverrequeststring = (''.join( serverrequestarray )).decode('utf-8')
   result = sitename + u': '
   if serverrequeststring.find(checktext) > -1:
      result = result + '<font color="green">OK</font>'
   else:
      result = result + '<font color="red">FAIL</font>'
   return result

   result = u""

   try:
      result = result + checksite(u'techblog', "http://hughperkins.com/techblog", u'Hugh Perkins') + u'<br />'
      result = result + checksite(u'writerblog', "http://manageddreams.com/writerblog", u'Hugh Perkins') + u'<br />'
      # etc ...

Hmmm, idea: maybe I could wrap this up in a website itself, and let people register their own sites in it? I guess this probably exists somewhere, but maybe for a ridiculous price, so there could be an opportunity to make something similar, but ad-supported?

Python, jinja, print and unicode…

Monday, April 5th, 2010

This took me a while to figure out…

I wanted to display some Chinese characters in Jinja, in Python.

The problem was I would get errors like:

  • UnicodeDecodeError: ‘ascii’ codec can’t decode byte
  • UnicodeEncodeError: ‘ascii’ codec can’t encode characters in position

I figured it was going to be one of those errors which needs lots of reading to figure out the answer, but, once you know the answer, the actual tweaks to the code are minimal; and that was the case for me.

Ultimately, this page from diveintopython.org turned out to be pretty helpful for me.

First, what was I doing? I had something like the following code:

sometext = gettextfromsomewhere() # eg by downloading a web-page
env = jinja2.Environment(loader=jinja2.PackageLoader('jinjaapplication', 'templates'))
template = env.get_template('mypage.html')
print template.render( sometext = sometext )

sometext was the result of using urllib to download a webpage. The webpage contained Chinese characters, encoded as utf-8.

Logically I felt sometext would be also utf-8 encoded unicode, which is half-right.

I tried all sorts of things like converting sometext to unicode using unicode(), decode, encode and so on, but it still would not work.

Ultimately, I feel there were a key concepts I needed in order to fix the problem were:

  • print doesn’t seem to be unicode-aware
  • decode and encode; str vs unicode

A biggie for me was that print does not seem to unicode aware. print seems to be geared up simply for outputing bytes, byte by byte.

print expects not a unicode object, but a plain old ‘str’, containing the unicode characters converted into plain old bytes.

Converting a unicode string into an appropriate str is easy once one knows how, but fairly counter-intuitive I felt. It can be done like this:

print someunicodestring.encode('utf-8')

I felt calling this ‘encode’ was counter-intuitive, since we’re changing away from utf-8, I felt, but I suppose one way of thinking about it is that we are changing from utf-8 into some sort of coded bytes.

This was half the solution, and the biggest part for me.

The other half is figuring out how to deal with the incoming string from the downloaded website. The incoming string is a ‘str’, and it actually contains the chinese characters in utf-8, only they are being stored byte by byte as a str, rather than stored character by character as a unicode object.

To convert from the utf-8 str to a utf-8 unicode object is the reverse of printing. We can do simply:

sometext = gettextfromsomewhere().decode('utf-8')

The ‘decode’ function takes the str, which is a byte array, albeit in utf-8 encoding, and converts it into a unicode array, ie each value is a single character. I think. Anyway, it works :-P

So, the full solution needed two things:

  • convert the incoming utf-8 str into utf-8 unicode using decode(‘utf-8′)
  • convert the unicode coming from jinja back into a utf-8 str, ready for printing, using encode(‘utf-8′)
sometext = gettextfromsomewhere().decode('utf-8') # eg by downloading a web-page
env = jinja2.Environment(loader=jinja2.PackageLoader('jinjaapplication', 'templates'))
template = env.get_template('mypage.html')
print template.render( sometext = sometext ).encode('utf-8')

… and this seems to work perfectly.

Dreamhost upgraded to 64-bit, debian 4

Sunday, April 4th, 2010

Dreamhost has upgraded itself to 64-bit servers, running debian 4.

Python is 2.4 by default, and 2.5 is available, by running ‘python2.4′ instead of just ‘python’.

I moved my old ‘local’ directory to ‘_local’ and recreated it. For now, I’m just using the built-in php, rather than building my own. I still seem able to log in to my blog using openid so I guess thats ok. I reinstalled python virtualenv, so I could install sqlalchemy and so on, so that http://manageddreams.com/ailaddergrid is working again now.

Edit: getfacl is still unavailable :-/ Also, debian 4 sounds a little old, considering that 5 has been out for ages now. Anyway… better than debian 3 :-P