views:

165

answers:

2

Hi everyone,

I've always found character sets and encodings complicated to understand and here I'm faced with another problem. My apologies for any inaccuracies. I'll do my best.

I'm requesting data from a server which returns JSON. In the HTTP headers it also returns the character set like so:

Content-Type: text/html; charset=UTF-8

I'm using the JSON library in Python to load the JSON using the json.loads method. When I pass it the returned JSON, it gives me a dictionary in Unicode. I've Googled around and I know that JSON should return Unicode as JavaScript strings are Unicode objects. How can I load the JSON as UTF-8? I would like to use the same encoding as specified in the response header.

I've read this post but it didn't help.

Thank you.

+2  A: 

json.loads automatically handles strs that are passed to it in UTF-8, so, in this specific case, you shouldn't have to worry about charsets yourself. loads is already converting from UTF-8 to Python's UCS-2 Unicode representation for you.

Unless you have some other reason why you really need to operate on the original UTF-8, you should feel fine, even though you're passing in a str and getting back unicodes. You can also specify the input encoding as the second parameter to loads if you want to be sure or if you're dealing with varying charsets.

zerocrates
+1  A: 

From application/json rfc:

JSON text SHALL be encoded in Unicode. The default encoding is UTF-8.

Since the first two characters of a JSON text will always be ASCII characters [RFC0020], it is possible to determine whether an octet stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking at the pattern of nulls in the first four octets.

       00 00 00 xx  UTF-32BE
       00 xx 00 xx  UTF-16BE
       xx 00 00 00  UTF-32LE
       xx 00 xx 00  UTF-16LE
       xx xx xx xx  UTF-8

So given json text as a bytestring it is always possible to convert it to unicode string. Given unicode string you can convert it if desired to another bytestring using any encoding you like.

json.loads() uses specified encoding (default is 'utf-8'). if input encoding is not ASCII-based then the text should be manually converted to unicode before passing it to json.loads().

J.F. Sebastian