views:

200

answers:

1

I have the following word that I fetch via a web service: André

From Python, the value looks like: "Andr\u00c3\u00a9". The input is then decoded using json.loads:

>>> import json
>>> json.loads('{"name":"Andr\\u00c3\\u00a9"}')
>>> {u'name': u'Andr\xc3\xa9'}

When I store the above in a utf8 MySQL database, the data is stored like the following using Django:

SomeObject.objects.create(name=u'Andr\xc3\xa9')

Querying the name column from a mysql shell or displaying it in a web page gives: André

The web page displays in utf8:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

My database is configured in utf8:

mysql> SHOW VARIABLES LIKE 'collation%';
+----------------------+-----------------+
| Variable_name        | Value           |
+----------------------+-----------------+
| collation_connection | utf8_general_ci | 
| collation_database   | utf8_unicode_ci | 
| collation_server     | utf8_unicode_ci | 
+----------------------+-----------------+
3 rows in set (0.00 sec)

mysql> SHOW VARIABLES LIKE 'character_set%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8                       | 
| character_set_connection | utf8                       | 
| character_set_database   | utf8                       | 
| character_set_filesystem | binary                     | 
| character_set_results    | utf8                       | 
| character_set_server     | utf8                       | 
| character_set_system     | utf8                       | 
| character_sets_dir       | /usr/share/mysql/charsets/ | 
+--------------------------+----------------------------+
8 rows in set (0.00 sec)

How can I retrieve the word André from a web service, store it properly in a database with no data loss and display it on a web page in its original form?

+4  A: 

The fault is already in the string you pass to json.loads(). \u00c3 is "A tilde" and \00a9 is the copyright sign. Correct for é would be \u00e9.

Probably the string has been encoded in UTF-8 by the sender and decoded as ISO-8859-1 by the receiver.

For example, if you run the following Python script:

# -*- encoding: utf-8 -*-

import json

data = {'name': u'André'}
print('data: {0}'.format(repr(data)))

code = json.dumps(data)
print('code: {0}'.format(repr(code)))

conv = json.loads(code)
print('conv: {0}'.format(repr(conv)))

name = conv['name']
print(u'Name is {0}'.format(name))

The output should look like:

data: {'name': u'Andr\xe9'}
code: '{"name": "Andr\\u00e9"}'
conv: {u'name': u'Andr\xe9'}
Name is André

Managing unicode in Python 2.x can sometimes become a nuisance. Unfortunately, Django does not yet support Python 3.

Bernd Petersohn
+1 but I don't see how Python 3.x stops people doing `rawbytes.decode('latin1')` instead of `rawbytes.decode('utf8')`. Data is NOT stored and transmitted in Unicode; it is still encoded into bytes, and the receiver still needs to know *which* encoding to decode it with.
John Machin
What do you use to check the actual value of \u00c3, \00a9 or \u00e9?
Thierry Lam
@John Machin: You are right. However, in most cases, the receiver should be able to derive the encoding. For example, if a web server sends out an html page with a form in a particular encoding, practically all browsers will send back the form data in the same encoding.@Thierry Lam: If you can't find a character map viewer under some accessoiries menu on your system, try google for "unicode codepoints". Example site: http://inamidst.com/stuff/unidata/
Bernd Petersohn