ansaurus

Question

Answer 1

+4 A:

I thought any unicode character other than the ASCII subset had a multi-byte representation in UTF-8. Your y makes sense as a single-byte-per-char string, but not as a UTF-8 string. Because the single byte is outside the 0x00 to 0x7F ASCII range, the codec will expect an extra byte or more for the conversion to a "real" unicode character.

I'm not as familiar with Python as I once was, though, and I'm not confident about this answer.

EDIT hops is the better answer IMO.

Steve314 2010-07-10 22:07:49

Answer 2

+5 A:

\x92 is not a valid utf-8 encoded character.

You don't notice that because you use simple (non-unicode) strings for x and y until you try to decode them into unicode strings. When you then print them, they are simple dumped to the terminal "as is" and the terminal itself interprets the bytes according to its encoding setting.

There is a third parameter to unicode() that tells python what to do in case of encoding (decoding) errors:

>>> unicode('\x92', 'utf8', 'replace')
u'\ufffd'
>>> print _
�

hop 2010-07-10 22:28:15

@hop: "You don't notice that because you use simple (non-unicode) strings for x and y until you try to decode them into unicode strings." -- so you're saying that the simple non-unicode string "\xd0\xa4" has been magically transmogrified into the unicode character U+0424 CYRILLIC CAPITAL LETTER EF without any decoding happening??

John Machin 2010-07-10 22:44:28

@John: no, i don't say that at all. there is nothing magic about the terminal decoding a valid utf-8 sequence into a unicode character to display. it's just not python that does any decoding.

hop 2010-07-10 23:17:25

@John: The terminal decodes that "\xd0\xa4" to the U+0424 because your terminal is configured for UTF-8, which is typically the default nowadays. If it was set to something else, this would not work.

Thanatos 2010-07-10 23:45:30

@hop: The essence of the problem is that *in this case* "the terminal" decodes the byte string in a fashion inconsistent with `unicode(y, 'utf8')`.

John Machin 2010-07-10 23:58:12

@Thanatos: I'm well aware that utf8 is typically the default (for *x terminals). My point was that @hop's original text appeared to be saying that the terminal wasn't doing any decoding at all.

John Machin 2010-07-11 00:01:52

@John: no that is not the essence of the problem _at all_.

hop 2010-07-11 00:34:28

@hop: I say: terminal is implicitly using 'replace', OP's code is implicitly using 'strict', OP's expectation is based on terminal's behaviour. So what is the essence of the problem according to you?

John Machin 2010-07-11 01:38:20

@John: wrong expectations based on ignorance regarding encodings. i also strongly suspect that the OP originally had a very different problem, since django rarely envolves terminals.

hop 2010-07-11 10:59:00

Answer 3

+3 A:

Looks like you have a typo; should be x = '\xd0\xa4'. It helps very much if you use copy paste of what you actually ran and what appeared on the output.

"\x92" is not a valid UTF-8 string. This explains the exception that you got.

More of a puzzle is why print y produced ?. What are you calling "the Python console"?? It appears to be operating in "replace" mode and substituting "?" ... are you sure that it's a plain "?" and not a white "?" inside a black diamond? Why do you say that "?" is exactly what you expect to see?

UPDATE: You now say """When I look at the database at the row that contains the '\x92' value, I see this character as ’. An apostrophe. I'm viewing the contents of the database using a Unicode UTF-8 encoding."""

That's not an apostrophe. It seems that that piece of data has been encoded using one of the cp125X (aka windows-125X) encodings. Illustrating using cp1252 (the usual suspect):

IDLE 2.6.4      
>>> import unicodedata
>>> uc = '\x92'.decode('cp1252')
>>> print repr(uc)
u'\u2019'
>>> print uc
’
>>> unicodedata.name(uc)
'RIGHT SINGLE QUOTATION MARK'
>>>

Instead of "viewing the contents of the database using a Unicode UTF-8 encoding" (whatever that means), try writing a small snippet of Python code to extract the offending string and then do print repr(bad_string). Show us the code that you ran, plus the output of the repr(). Also tell us which version of Python, what platform (Windows or unix-based), and what version of what database software. And the part of the CREATE TABLE statement relevant to the column in question.

Also please read this and this.

John Machin 2010-07-10 22:29:19

didn't i tell you this would happen? :)

hop 2010-07-12 07:44:53

@hop: No, you said you suspected that there was a different underlying problem. And that was like saying that you suspected that the sun rises in the east -- an OP rarely asks the question they should have asked.

John Machin 2010-07-12 07:54:58

Answer 4

+1 A:

0x92 (hex) = 10 010010 (binary)

As UTF-8 can represent 010010 in one byte, the "header" must be 0 (--> 00010010) instead of 10 (which can never be the header of the first byte). Characters may not be represented with more bytes than needed, so "\x92" is not a valid UTF-8 encoded string.

I guess your database uses some one-byte-per-character encoding (such as latin-1). If you're coding the database queries yourself, you must ensure that the connection encoding is correct or that strings are decoded correctly. With Django models, everything should work automatically.

AndiDog 2010-07-10 22:36:42

Answer 5

A:

I see now where you're confused. Let's look at this:

x = '\xd0\xa4'
y = '\x92'

If I print x, I get Ф. This is because my terminal is using UTF-8 as its character encoding. Thus, when it gets D0 A4, it attempts to decode it as UTF-8, and gets a "Ф". If I change my terminal to use, say, ISO-8859-1 ("latin1"), and I say print x, my terminal will attempt to decode D0 A4 using ISO-8859-1, and since D0 A4 is also a valid ISO-8859-1 string, it will decode, but this time, to "Ð¤".

Now, for print y. This isn't a UTF-8 string, so my terminal can't decode this. It shows me this error, in my case, by printing "�". I'm wondering if you see "�" or "?" - you should probably see the former, but it depends on what your terminal does in the face of bad output.

Your terminal's encoding should match whatever $LANG says, and your program should output data in whatever encoding $LANG specifies. Nowadays, $LANG is typically ???.UTF-8, where the ??? varies. (Mine is en_US.UTF-8)

Now, when you say unicode(y, 'utf8'), Python attempts to decode this as UTF-8, and appropriately throws an exception.

I'm using Gnome Terminal, and can change my character encoding by going to Terminal → Set Character Encoding

Thanatos 2010-07-10 23:58:16

ansaurus

tags:

views:

answers:

Python UTF8 string confusion

related questions