views:

48

answers:

1

Hiya,

I am adding UTF-8 data to a database in Django.

As the data goes into the database, everything looks fine - the characters (for example): “Hello” are UTF-8 encoded.

My MySQL database is UTF-8 encoded. When I examine the data from the DB by doing a select, my example string looks like this: ?Hello?. I assume this is showing the characters as UTF-8 encoded.

When I select the data from the database in the terminal or for export as a web-service, however - my string looks like this: \u201cHello World\u201d.

Does anyone know how I can display my characters correctly?

Do I need to perform some additional UTF-8 encoding somewhere?

Thanks, Nick.

+4  A: 
u'\u201cHello World\u201d'

Is the correct Python representation of the Unicode text “Hello World”. The smartquote characters are being displayed using a \uXXXX hex escape rather than verbatim because there are often problems with writing Unicode characters to the terminal, particularly on Windows. (It looks like MySQL tried to write them to the terminal but failed, resulting in the ? placeholders.)

On a terminal that does manage to correctly input and output Unicode characters, you can confirm that they're the same thing:

Python 2.6.5 (r265:79063, Apr 16 2010, 13:57:41) [GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> u'\u201cHello World\u201d'==u'“Hello World”'
True

just as for byte strings, \x sequences are just the same as characters:

>>> '\x61'=='a'
True

Now if you've got \u or \x sequences escaping Python and making their way into an exported file, then you've done something wrong with the export. Perhaps you used repr() somewhere by mistake.

bobince
Yes - you're spot on. Thankyou for the detailed explination too!!I needed to add 'ensure_ascii=False' in my export: 'HttpResponse(simplejson.dumps(final, ensure_ascii=False));'
Nick Cartwright
Ah, it was a JSON response? In that case it would still be fine: `\u` escapes are just as valid in JavaScript string literals as they are in Python. `ensure_ascii=False` gives you slightly smaller JSON output, but be careful as it won't encode the U+2028 and U+2029 characters, which act as line separators in JavaScript. They're allowed unescaped in string literals in JSON, but if you `eval()` them from JavaScript (a common way to evaluate JSON on older browsers that don't have the native `JSON` object) you'll get a syntax error.
bobince