views:

171

answers:

7

The following unicode and string can exist on their own if defined explicitly:

>>> value_str='Andr\xc3\xa9'
>>> value_uni=u'Andr\xc3\xa9'

If I only have u'Andr\xc3\xa9' assigned to a variable like above, how do I convert it to 'Andr\xc3\xa9' in Python 2.5 or 2.6?

EDIT:

I did the following:

>>> value_uni.encode('latin-1')
'Andr\xc3\xa9'

which fixes my issue. Can someone explain to me what exactly is happening?

A: 

It seems like

str(value_uni)

should work... at least, it did when I tried it.

EDIT: Turns out that this only works because my system's default encoding is, as far as I can tell, ISO-8859-1 (Latin-1). So for a platform-independent version of this, try

value_uni.encode('latin1')
David Zaslavsky
I tried that but I get UnicodeEncodeError: 'ascii' codec can't encode characters in position 4-5: ordinal not in range(128). Which Python version are you using and on which OS?
Thierry Lam
Python 2.6.4 on Linux... although now that I think about it, it's possible my system's default encoding is set differently from yours. I'm not entirely sure what my default encoding is, though.
David Zaslavsky
OK, got it, try the new method.
David Zaslavsky
How do you check what your system default encoding is?
Thierry Lam
@Thierry Lam, `import sys; sys.getdefaultencoding()`
tgray
Not to be pushy, but it would be nice to lose the downvote since I've edited my answer to include the correct solution...
David Zaslavsky
+1  A: 

value_uni.encode('utf8') or whatever encoding you need.

See http://docs.python.org/library/stdtypes.html#str.encode

UncleZeiv
Just to add. The above may seem the same, but the Unicode literal is made of code points that correspond to symbols and normal string is meaningless unless you know the encoding.
dhill
I get 'Andr\xc3\x83\xc2\xa9', isn't this different than 'Andr\xc3\xa9'?
Thierry Lam
@Thierry: That's what you get if you screw up and put UTF-8 in a unicode.
Ignacio Vazquez-Abrams
Yes, and this is predictable. I think there is no encoding that will convert Unicode code points in range(128,256) to respective bytes. Proove me wrong.
dhill
Converting to utf-8 will blow the \xc3 into two bytes! And converting to ascii won't work because \xc3 is not a in the ascii range.
I. J. Kennedy
@dhill: By design, latin1 aka ISO-8859-1 does exactly what you are talking about. The first 256 codepoints of Unicode are deliberately the same as latin1. Do this: `assert all(ord(chr(x).decode('latin1')) == x for x in range(256)); assert all(ord(unichr(x).encode('latin1')) == x for x in range(256))`
John Machin
@John Machin: True, but I meant Unicode encoding and I haven't put an adjective here. I reasoned that there must be at least one special character to build code points larger than a byte.
dhill
@dhill: What is a "Unicode encoding"? What do you mean by "special character"? What is a "codepoint larger than a byte"?
John Machin
A: 

Simplified explanation. The str type is able to hold only characters from 0-255 range. If you want to store unicode (which can contain characters from much wider range) in str you first have to encode unicode to format suitable for str, for example UTF-8.

To do this call method encode on your str object and as an argument give desired encoding, for example this_is_str = value_uni.encode('utf-8').

You can read longer and more in-depth (and language agnostic) article on Unicode handling here: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

Another excellent article (this time Python-specific): Unicode HOWTO

Bartosz
+1  A: 

You seem to have gotten your encodings muddled up. It seems likely that what you really want is u'Andr\xe9' which is equivalent to 'André'.

But what you have seems to be a UTF-8 encoding that has been incorrectly decoded. You can fix it by converting the unicode string to an ordinary string. I'm not sure what the best way is, but this seems to work:

>>> ''.join(chr(ord(c)) for c in u'Andr\xc3\xa9')
'Andr\xc3\xa9'

Then decode it correctly:

>>> ''.join(chr(ord(c)) for c in u'Andr\xc3\xa9').decode('utf8')
u'Andr\xe9'    

Now it is in the correct format.

However instead of doing this, if possible you should try to work out why the data has been incorrectly encoded in the first place, and fix that problem there.

Mark Byers
+1  A: 

The OP is not converting to ascii nor utf-8. That's why the suggested encode methods won't work. Try this:

v = u'Andr\xc3\xa9'
s = ''.join(map(lambda x: chr(ord(x)),v))

The chr(ord(x)) business gets the numeric value of the unicode character (which better fit in one byte for your application), and the ''.join call is an idiom that converts a list of ints back to an ordinary string. No doubt there is a more elegant way.

I. J. Kennedy
A: 

If you have u'Andr\xc3\xa9', it was likely originally UTF-8 from whatever source this was obtained from. If possible, read the source again decoding with 'utf-8' instead. Otherwise just reverse the mistake:

>>> print u'Andr\xc3\xa9'.encode('latin-1').decode('utf-8')
André
Mark Tolonen
+1  A: 

You asked (in a comment) """That is what's puzzling me. How did it go from it original accented to what it is now? When you say double encoding with utf8 and latin1, is that a total of 3 encodings(2 utf8 + 1 latin1)? What's the order of the encode from the original state to the current one?"""

In the answer by Mark Byers, he says """what you have seems to be a UTF-8 encoding that has been incorrectly decoded""". You have accepted his answer. But you are still puzzled? OK, here's the blow-by-blow description:

Note: All strings will be displayed using (implicitly) repr(). unicodedata.name() will be used to verify the contents. That way, variations in console encoding cannot confuse interpretation of the strings.

Initial state: you have a unicode object that you have named u1. It contains e-acute:

>>> u1 = u'\xe9'
>>> import unicodedata as ucd
>>> ucd.name(u1)
'LATIN SMALL LETTER E WITH ACUTE'

You encode u1 as UTF-8 and name the result s:

>>> s = u1.encode('utf8')
>>> s
'\xc3\xa9'

You decode s using latin1 -- INCORRECTLY; s was encoded using utf8, NOT latin1. The result is meaningless rubbish.

>>> u2 = s.decode('latin1')
>>> u2
u'\xc3\xa9'
>>> ucd.name(u2[0]); ucd.name(u2[1])
'LATIN CAPITAL LETTER A WITH TILDE'
'COPYRIGHT SIGN'
>>>

Please understand: unicode_object.encode('x').decode('y) when x != y is normally [see note below] a nonsense; it will raise an exception if you are lucky; if you are unlucky it will silently create gibberish. Also please understand that silently creating gibberish is not a bug -- there is no general way that Python (or any other language) can detect that a nonsense has been committed. This applies particularly when latin1 is involved, because all 256 codepoints map 1 to 1 with the first 256 Unicode codepoints, so it is impossible to get a UnicodeDecodeError from str_object.decode('latin1').

Of course, abnormally (one hopes that it's abnormal) you may need to reverse out such a nonsense by doing gibberish_unicode_object.encode('y').decode('x') as suggested in various answers to your question.

John Machin