views:

93

answers:

1

On windows, I have the following problem:

>>> string = "Don´t Forget To Breathe"
>>> import json,os,codecs
>>> f = codecs.open("C:\\temp.txt","w","UTF-8")
>>> json.dump(string,f)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python26\lib\json\__init__.py", line 180, in dump
    for chunk in iterable:
  File "C:\Python26\lib\json\encoder.py", line 294, in _iterencode
    yield encoder(o)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 3-5: invalid data

(Notice the non-ascii apostrophe in the string.)

However, my friend, on his mac (also using python2.6), can run through this like a breeze:

> string = "Don´t Forget To Breathe"
> import json,os,codecs
> f = codecs.open("/tmp/temp.txt","w","UTF-8")
> json.dump(string,f)
> f.close(); open('/tmp/temp.txt').read()
'"Don\\u00b4t Forget To Breathe"'

Why is this? I've also tried using UTF-16 and UTF-32 with json and codecs, but to no avail.

+1  A: 

What does repr(string) show on each machine? On my Mac the apostrophe shows as \xc2\xb4 (utf8 coding, 2 bytes) so of course the utf8 codec can deal with it; on your Windows it clearly isn't doing that since it talks about three bytes being a problem - so on Windows you must have some other, non-utf8 encoding set for your console.

Your general problem is that, in Python pre-3, you should not enter a byte string ("...." as you used, rather than u"....") with non-ascii content (unless specifically as escape strings): this may (depending on how the session is set) fail directly or produce bytes, according to some codec set as the default one, which are not the exact bytes you expect (because you're not aware of the exact default codec in use). Use explicit Unicode literals

string = u"Don´t Forget To Breathe"

and you should be OK (or if you have any problem it will emerge right at the time of this assignment, at which point we may go into the issue of "how to I set a default encoding for my interactive sessions" if that's what you require).

Alex Martelli
Hm. Interestingly, that works. However, in the actual implementation that I'm working with (the code above was just an isolated example), what I'm really trying to serialize with json is an object created by this function: http://pastebin.com/e0CNAvCETraverse a directory, find all MP3s, and construct a dictionary from their metadata. Naturally there'll be some special characters in there, but I thought I had already dealt with that possibility by wrapping a unicode() around the metadata. Is there a way in which my unicode() approach in this function is different to the u"" example?
ventolin
@ventolin, no: the `unicode` calls themselves should immediately fail any time their argument contains non-Ascii characters (since you're not specifying an encoding, `'ascii'` should be getting used there). I'm unable to guess how they could succeed and yet the json serialization fail, esp. with the **decode** error you report (maybe the metadata's ascii but the _file paths_ themselves are not...?)
Alex Martelli