views:

161

answers:

5

Been banging my head on this for a while and I've read a bunch of articles and the issue isn't any clearer. I have a bunch of strings stored in my database, imagine the following:

x = '\xd0\xa4'
y = '\x92'

At the Python shell I get the following:

print x
Ф
print y
?

Which is exactly what I want to see. However then there is the following:

print unicode(x, 'utf8')
Ф

But not this:

unicode(y, 'utf8')
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 0: unexpected code byte

My feeling is that our strings are getting mangled because Django tries to convert them to unicode, but I'm just guessing at this point. Any insights or workarounds appreciated.

UPDATE: When I look at the database at the row that contains the '\x92' value, I see this character as ’. An apostrophe. I'm viewing the contents of the database using a Unicode UTF-8 encoding.

+4  A: 

I thought any unicode character other than the ASCII subset had a multi-byte representation in UTF-8. Your y makes sense as a single-byte-per-char string, but not as a UTF-8 string. Because the single byte is outside the 0x00 to 0x7F ASCII range, the codec will expect an extra byte or more for the conversion to a "real" unicode character.

I'm not as familiar with Python as I once was, though, and I'm not confident about this answer.

EDIT hops is the better answer IMO.

Steve314
+5  A: 

\x92 is not a valid utf-8 encoded character.

You don't notice that because you use simple (non-unicode) strings for x and y until you try to decode them into unicode strings. When you then print them, they are simple dumped to the terminal "as is" and the terminal itself interprets the bytes according to its encoding setting.

There is a third parameter to unicode() that tells python what to do in case of encoding (decoding) errors:

>>> unicode('\x92', 'utf8', 'replace')
u'\ufffd'
>>> print _
�
hop
@hop: "You don't notice that because you use simple (non-unicode) strings for x and y until you try to decode them into unicode strings." -- so you're saying that the simple non-unicode string "\xd0\xa4" has been magically transmogrified into the unicode character U+0424 CYRILLIC CAPITAL LETTER EF without any decoding happening??
John Machin
@John: no, i don't say that at all. there is nothing magic about the terminal decoding a valid utf-8 sequence into a unicode character to display. it's just not python that does any decoding.
hop
@John: The terminal decodes that "\xd0\xa4" to the U+0424 because your terminal is configured for UTF-8, which is typically the default nowadays. If it was set to something else, this would not work.
Thanatos
@hop: The essence of the problem is that *in this case* "the terminal" decodes the byte string in a fashion inconsistent with `unicode(y, 'utf8')`.
John Machin
@Thanatos: I'm well aware that utf8 is typically the default (for *x terminals). My point was that @hop's original text appeared to be saying that the terminal wasn't doing any decoding at all.
John Machin
@John: no that is not the essence of the problem _at all_.
hop
@hop: I say: terminal is implicitly using 'replace', OP's code is implicitly using 'strict', OP's expectation is based on terminal's behaviour. So what is the essence of the problem according to you?
John Machin
@John: wrong expectations based on ignorance regarding encodings. i also strongly suspect that the OP originally had a very different problem, since django rarely envolves terminals.
hop
+3  A: 

Looks like you have a typo; should be x = '\xd0\xa4'. It helps very much if you use copy paste of what you actually ran and what appeared on the output.

"\x92" is not a valid UTF-8 string. This explains the exception that you got.

More of a puzzle is why print y produced ?. What are you calling "the Python console"?? It appears to be operating in "replace" mode and substituting "?" ... are you sure that it's a plain "?" and not a white "?" inside a black diamond? Why do you say that "?" is exactly what you expect to see?

UPDATE: You now say """When I look at the database at the row that contains the '\x92' value, I see this character as ’. An apostrophe. I'm viewing the contents of the database using a Unicode UTF-8 encoding."""

That's not an apostrophe. It seems that that piece of data has been encoded using one of the cp125X (aka windows-125X) encodings. Illustrating using cp1252 (the usual suspect):

IDLE 2.6.4      
>>> import unicodedata
>>> uc = '\x92'.decode('cp1252')
>>> print repr(uc)
u'\u2019'
>>> print uc
’
>>> unicodedata.name(uc)
'RIGHT SINGLE QUOTATION MARK'
>>> 

Instead of "viewing the contents of the database using a Unicode UTF-8 encoding" (whatever that means), try writing a small snippet of Python code to extract the offending string and then do print repr(bad_string). Show us the code that you ran, plus the output of the repr(). Also tell us which version of Python, what platform (Windows or unix-based), and what version of what database software. And the part of the CREATE TABLE statement relevant to the column in question.

Also please read this and this.

John Machin
didn't i tell you this would happen? :)
hop
@hop: No, you said you suspected that there was a different underlying problem. And that was like saying that you suspected that the sun rises in the east -- an OP rarely asks the question they should have asked.
John Machin
+1  A: 
0x92 (hex) = 10 010010 (binary)

As UTF-8 can represent 010010 in one byte, the "header" must be 0 (--> 00010010) instead of 10 (which can never be the header of the first byte). Characters may not be represented with more bytes than needed, so "\x92" is not a valid UTF-8 encoded string.

I guess your database uses some one-byte-per-character encoding (such as latin-1). If you're coding the database queries yourself, you must ensure that the connection encoding is correct or that strings are decoded correctly. With Django models, everything should work automatically.

AndiDog
A: 

I see now where you're confused. Let's look at this:

x = '\xd0\xa4'
y = '\x92'

If I print x, I get Ф. This is because my terminal is using UTF-8 as its character encoding. Thus, when it gets D0 A4, it attempts to decode it as UTF-8, and gets a "Ф". If I change my terminal to use, say, ISO-8859-1 ("latin1"), and I say print x, my terminal will attempt to decode D0 A4 using ISO-8859-1, and since D0 A4 is also a valid ISO-8859-1 string, it will decode, but this time, to "Ф".

Now, for print y. This isn't a UTF-8 string, so my terminal can't decode this. It shows me this error, in my case, by printing "�". I'm wondering if you see "�" or "?" - you should probably see the former, but it depends on what your terminal does in the face of bad output.

Your terminal's encoding should match whatever $LANG says, and your program should output data in whatever encoding $LANG specifies. Nowadays, $LANG is typically ???.UTF-8, where the ??? varies. (Mine is en_US.UTF-8)

Now, when you say unicode(y, 'utf8'), Python attempts to decode this as UTF-8, and appropriately throws an exception.

I'm using Gnome Terminal, and can change my character encoding by going to Terminal → Set Character Encoding

Thanatos