views:

1322

answers:

3

Hi,

I've got a problem with strings that I get from one of my clients over xmlrpc. He sends me utf8 strings that are encoded twice :( so when I get them in python I have an unicode object that has to be decoded one more time, but obviously python doesn't allow that. I've noticed my client however I need to do quick workaround for now before he fixes it.

Raw string from tcp dump:

<string>Rafa\xc3\x85\xc2\x82</string>

this is converted into:

u'Rafa\xc5\x82'

The best we get is:

eval(repr(u'Rafa\xc5\x82')[1:]).decode("utf8")

This results in correct string which is:

u'Rafa\u0142'

this works however is ugly as hell and cannot be used in production code. If anyone knows how to fix this problem in more suitable way please write. Thanks, Chris

+6  A: 
>>> s = u'Rafa\xc5\x82'
>>> s.encode('raw_unicode_escape').decode('utf-8')
u'Rafa\u0142'
>>>
Ivan Baldin
@partisann: Neat! I didn't know about raw_unicode_escape (obviously 8-)
RichieHindle
Thanks partisann, I haven't know about it neither.
Chris Ciesielski
A: 

Yow, that was fun!

>>> original = "Rafa\xc3\x85\xc2\x82"
>>> first_decode = original.decode('utf-8')
>>> as_chars = ''.join([chr(ord(x)) for x in first_decode])
>>> result = as_chars.decode('utf-8')
>>> result
u'Rafa\u0142'

So you do the first decode, getting a Unicode string where each character is actually a UTF-8 byte value. You go via the integer value of each of those characters to get back to a genuine UTF-8 string, which you then decode as normal.

RichieHindle
A: 
>>> weird = u'Rafa\xc5\x82'
>>> weird.encode('latin1').decode('utf8')
u'Rafa\u0142'
>>>

latin1 is just an abbreviation for Richie's nuts'n'bolts method.

It is very curious that the seriously under-described raw_unicode_escape codec gives the same result as latin1 in this case. Do they always give the same result? If so, why have such a codec? If not, it would preferable to know for sure exactly how the OP's client did the transformation from 'Rafa\xc5\x82' to u'Rafa\xc5\x82' and then to reverse that process exactly -- otherwise we might come unstuck if different data crops up before the double encoding is fixed.

John Machin
When your string contains only codepoints 0-255, it's always the same. The differences is characters above that; raw_unicode_escape will escape them, eg. \u1234, where latin1 will throw UnicodeEncodeError. (Decoding has the symmetric difference--raw_unicode_escape decodes \u1234 escapes, latin1 does not, but it's only encoding here.) They're equivalent here, but I'd stick with latin1, since this has nothing to do with escaping and latin1 is a more widely understood encoding.
Glenn Maynard
Thanks Glenn, thinking about backslashes after midnight turned my brain into a pumpkin :-)
John Machin