This took me a while to figure out…
I wanted to display some Chinese characters in Jinja, in Python.
The problem was I would get errors like:
- UnicodeDecodeError: ‘ascii’ codec can’t decode byte
- UnicodeEncodeError: ‘ascii’ codec can’t encode characters in position
I figured it was going to be one of those errors which needs lots of reading to figure out the answer, but, once you know the answer, the actual tweaks to the code are minimal; and that was the case for me.
Ultimately, this page from diveintopython.org turned out to be pretty helpful for me.
First, what was I doing? I had something like the following code:
sometext = gettextfromsomewhere() # eg by downloading a web-page
env = jinja2.Environment(loader=jinja2.PackageLoader('jinjaapplication', 'templates'))
template = env.get_template('mypage.html')
print template.render( sometext = sometext )
sometext was the result of using urllib to download a webpage. The webpage contained Chinese characters, encoded as utf-8.
Logically I felt sometext would be also utf-8 encoded unicode, which is half-right.
I tried all sorts of things like converting sometext to unicode using unicode(), decode, encode and so on, but it still would not work.
Ultimately, I feel there were a key concepts I needed in order to fix the problem were:
- print doesn’t seem to be unicode-aware
- decode and encode; str vs unicode
A biggie for me was that print does not seem to unicode aware. print seems to be geared up simply for outputing bytes, byte by byte.
print expects not a unicode object, but a plain old ‘str’, containing the unicode characters converted into plain old bytes.
Converting a unicode string into an appropriate str is easy once one knows how, but fairly counter-intuitive I felt. It can be done like this:
print someunicodestring.encode('utf-8')
I felt calling this ‘encode’ was counter-intuitive, since we’re changing away from utf-8, I felt, but I suppose one way of thinking about it is that we are changing from utf-8 into some sort of coded bytes.
This was half the solution, and the biggest part for me.
The other half is figuring out how to deal with the incoming string from the downloaded website. The incoming string is a ‘str’, and it actually contains the chinese characters in utf-8, only they are being stored byte by byte as a str, rather than stored character by character as a unicode object.
To convert from the utf-8 str to a utf-8 unicode object is the reverse of printing. We can do simply:
sometext = gettextfromsomewhere().decode('utf-8')
The ‘decode’ function takes the str, which is a byte array, albeit in utf-8 encoding, and converts it into a unicode array, ie each value is a single character. I think. Anyway, it works
So, the full solution needed two things:
- convert the incoming utf-8 str into utf-8 unicode using decode(‘utf-8′)
- convert the unicode coming from jinja back into a utf-8 str, ready for printing, using encode(‘utf-8′)
sometext = gettextfromsomewhere().decode('utf-8') # eg by downloading a web-page
env = jinja2.Environment(loader=jinja2.PackageLoader('jinjaapplication', 'templates'))
template = env.get_template('mypage.html')
print template.render( sometext = sometext ).encode('utf-8')
… and this seems to work perfectly.