ansaurus

Question

Answer 1

+4 A:

That's just UTF-8 data. Use .decode to convert it into unicode.

>>> 'D\xc3\xa9cor'.decode('utf-8')
u'D\xe9cor'

You can perform an additional string-escape decode for the 'D\\xc3\\xa9cor' case.

>>> 'D\xc3\xa9cor'.decode('string-escape').decode('utf-8')
u'D\xe9cor'
>>> 'D\\xc3\\xa9cor'.decode('string-escape').decode('utf-8')
u'D\xe9cor'
>>> u'D\\xc3\\xa9cor'.decode('string-escape').decode('utf-8')
u'D\xe9cor'

To handle the 2nd case as well, you need to detect if the input is unicode, and convert it into a str first.

>>> def conv(s):
...   if isinstance(s, unicode):
...     s = s.encode('iso-8859-1')
...   return s.decode('string-escape').decode('utf-8')
... 
>>> map(conv, [u'D\\xc3\\xa9cor', u'D\xc3\xa9cor', 'D\\xc3\\xa9cor', 'D\xc3\xa9cor'])
[u'D\xe9cor', u'D\xe9cor', u'D\xe9cor', u'D\xe9cor']

KennyTM 2010-06-07 05:58:51

It works for that particular case. However: u'D\\xc3\\xa9cor' --> u'D\\xc3\\xa9cor', u'D\xc3\xa9cor' --> UnicodeEncodeError, 'D\\xc3\\xa9cor' --> u'D\\xc3\\xa9cor',

Tyson 2010-06-07 06:06:00

@Tyson: It can't work for all cases. How can you make sure `'D:\\xc3\\xa9\\xc3xa9.png'` is really a UTF-8 encoded string, not a Windows path name?

KennyTM 2010-06-07 06:09:24

I can assume that none of the data I'm receiving are Windows pathnames.

Tyson 2010-06-07 06:17:27

@Tyson: In the comment you say `UnicodeEncodeError`. Notice that it's **En**code, not **De**code. Out of curiosity: Are you printing it out inside a loop (in a console or window)? It's just a wild guess on a Monday morning...

exhuma 2010-06-07 06:40:56

For debugging, yeah, I was tossing it out to `stdout`.

Tyson 2010-06-07 06:42:06

Answer 2

+2 A:

Write adapters that know which transformations should be applied to their sources.

>>> 'D\xc3\xa9cor'.decode('utf-8')
u'D\xe9cor'
>>> 'D\\xc3\\xa9cor'.decode('string-escape').decode('utf-8')
u'D\xe9cor'

Ignacio Vazquez-Abrams 2010-06-07 06:17:08

Answer 3

A:

Here's the solution I came to before I saw KennyTM's proper, more concise soltion:

def ensure_unicode(string):
    try:
        string = string.decode('string-escape').decode('string-escape')
    except UnicodeEncodeError:
        string = string.encode('raw_unicode_escape')

    return unicode(string, 'utf-8')

Tyson 2010-06-07 06:35:45

ansaurus

tags:

views:

answers:

Dealing with wacky encodings in Python

related questions