tags:

views:

156

answers:

3

I'm using Python 2.6.5 and when I run the following in the Python shell, I get:

>>> print u'Andr\xc3\xa9'
André
>>> print 'Andr\xc3\xa9'
André
>>>

What's the explanation for the above? Given u'Andr\xc3\xa9', how can I display the above value properly in an html page so that it shows André instead of André?

+6  A: 

'\xc3\xa9' is the UTF-8 encoding of the unicode character u'\u00e9' (which can also be specified as u'\xe9'). So you can use u'Andr\u00e9' or u'Andr\xe9'.

You can convert from one to the other:

>>> 'Andr\xc3\xa9'.decode('utf-8')
u'Andr\xe9'
>>> u'Andr\xe9'.encode('utf-8')
'Andr\xc3\xa9'

Note that the reason print 'Andr\xc3\xa9' gave you the expected result is only because your system's default encoding is UTF-8. For example, on Windows I get:

>>> print 'Andr\xc3\xa9'
André

As for outputting HTML, it depends on which web framework you use and what encoding you output in the HTML page. Some frameworks (e.g. Django) will convert unicode values to the correct encoding automatically, while others will require you to do so manually.

interjay
I'm currently using Django and the output displayed in the template is André. Do you know what I should do to make the template display André?
Thierry Lam
@Thierry Lam: Django assumes that all non-unicode strings are UTF-8. So in this case you can either use `'Andr\xc3\xa9'` (UTF-8 encoded string) or `u'Andre\xe9'` (unicode string).
interjay
+1  A: 

Try this:

>>> unicode('Andr\xc3\xa9', 'utf-8')
u'Andr\xe9'
>>> print u'Andr\xe9'
André

That may answer your question.

EDIT: or see the above answer

darelf
A: 

I am not sure, but I would guess that different codecs are applied by the print operation. Probably some utf-8 vs. unicode issue.

For HTML, you would need to encode certain characters using the HTML syntax for unicode. I think that the Python codecs module might be able to help you.

Uri