ansaurus

Question

Answer 1

+6 A:

>>> s = u'Rafa\xc5\x82'
>>> s.encode('raw_unicode_escape').decode('utf-8')
u'Rafa\u0142'
>>>

Ivan Baldin 2009-07-24 13:11:26

@partisann: Neat! I didn't know about raw_unicode_escape (obviously 8-)

RichieHindle 2009-07-24 13:17:42

Thanks partisann, I haven't know about it neither.

Chris Ciesielski 2009-07-27 09:10:47

Answer 2

A:

Yow, that was fun!

>>> original = "Rafa\xc3\x85\xc2\x82"
>>> first_decode = original.decode('utf-8')
>>> as_chars = ''.join([chr(ord(x)) for x in first_decode])
>>> result = as_chars.decode('utf-8')
>>> result
u'Rafa\u0142'

So you do the first decode, getting a Unicode string where each character is actually a UTF-8 byte value. You go via the integer value of each of those characters to get back to a genuine UTF-8 string, which you then decode as normal.

RichieHindle 2009-07-24 13:15:27

Answer 3

A:

>>> weird = u'Rafa\xc5\x82'
>>> weird.encode('latin1').decode('utf8')
u'Rafa\u0142'
>>>

latin1 is just an abbreviation for Richie's nuts'n'bolts method.

It is very curious that the seriously under-described raw_unicode_escape codec gives the same result as latin1 in this case. Do they always give the same result? If so, why have such a codec? If not, it would preferable to know for sure exactly how the OP's client did the transformation from 'Rafa\xc5\x82' to u'Rafa\xc5\x82' and then to reverse that process exactly -- otherwise we might come unstuck if different data crops up before the double encoding is fixed.

John Machin 2009-07-24 14:31:52

When your string contains only codepoints 0-255, it's always the same. The differences is characters above that; raw_unicode_escape will escape them, eg. \u1234, where latin1 will throw UnicodeEncodeError. (Decoding has the symmetric difference--raw_unicode_escape decodes \u1234 escapes, latin1 does not, but it's only encoding here.) They're equivalent here, but I'd stick with latin1, since this has nothing to do with escaping and latin1 is a more widely understood encoding.

Glenn Maynard 2009-07-24 18:58:20

Thanks Glenn, thinking about backslashes after midnight turned my brain into a pumpkin :-)

John Machin 2009-07-24 22:52:41

ansaurus

tags:

views:

answers:

Decoding double encoded utf8 in Python

related questions