tags:

views:

123

answers:

4

Does anyone know why the string conversion functions throw exceptions when errors="ignore" is passed? How can I convert from regular Python string objects to unicode without errors being thrown? Thanks very much!

python -c "import codecs; codecs.open('tmp', 'wb', encoding='utf8', errors='ignore').write('кошка')"

returns
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/python2.6/codecs.py", line 686, in write
return self.writer.write(data)
File "/usr/lib/python2.6/codecs.py", line 351, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128)

EDIT -- thanks for the responses, but does anyone know how to convert the literal above, not using the "u" prefix? The reason being is that you could, of course, be dealing with something that wasn't a constant :)

A: 

The write method (in Python 2) takes a unicode object, and you're passing it a str -- so the encode call in codecs.py line 351 is first trying to build a unicode object (with the default codec, 'ascii'). Fix is easy: change the write call to

write(u'кошка')

The u prefix tells Python you're using a Unicode object, and it should be fine.

Alex Martelli
+1  A: 

problem is here ===>>>> write('кошка')

You are writing a str object, the recipient is expecting a unicode object, so it tries to convert it to unicode using the default encoding (ascii), which of course (?) produces the well-known (?) UnicodeDecodeError: 'ascii' codec can't decode byte 0xXX in position 0: ordinal not in range(128)

The whole point of using the codecs module like that is to get it to convert your unicode objects to utf8-encoded on the fly -- so feed it unicode

Update How to convert the literal or non-literal:

unicode_object = literal_or_whatever.decode("UNKNOWN_ENCODING")

Do you know how your literal is encoded? Would you like to tell us what you are trying to accomplish? A one liner with python -c isn't much help ;-)

John Machin
the unicode() function doesn't work, it throws the same exception.
gatoatigrado
@gatoatigrado: I said to feed it unicode; I didn't say to use the `unicode()` function. If you use `unicode()` without specifying an encoding, OF COURSE it will get the same exception (default encoding is ascii). And please read the last sentence of my answer.
John Machin
ah, sorry, I was scanning, sorry. The question was "how" do I feed it unicode from a Python string. The logical way would be `unicode(pystr)`, but that doesn't work. I think I've used dumb tricks with bytes() before, but I'd like to know what the real solution is.
gatoatigrado
@gatoatigrado: unicode(str_object) as already explained is NOT the "logical" way (whatever that means) -- you need to know what encoding (e.g. "cp1252") your str_object is in, and then do `str_object.encode(that_encoding)` or `unicode(str_object, that_encoding)` (these are equivalent)
John Machin
+1  A: 

a non-solution (from question author) I just found out: use python3

python3 -c "import codecs; codecs.open('tmp', 'wb', encoding='utf8', errors='ignore').write('кошка')"
gatoatigrado
Yes, Python 3 has native unicode support. =]
Xavier Ho
+1  A: 

In Python 2.x use write('кошка'.decode('utf-8') instead of write('кошка').

You can use other encoding too instead of 'utf-8'.

Hopefully it will not throw any error ...

aberry