views:

1442

answers:

5

I have terminal.app set to accept utf-8 and in bash I can type unicode characters, copy and paste them, but if I start the python shell I can't and if I try to decode unicode I get errors:

>>> wtf = u'\xe4\xf6\xfc'.decode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
>>> wtf = u'\xe4\xf6\xfc'.decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)

Anyone know what I'm doing wrong?

+4  A: 

I think you have encoding and decoding backwards. You encode Unicode into a byte stream, and decode the byte stream into Unicode.

Python 2.6.1 (r261:67515, Dec  6 2008, 16:42:21) 
[GCC 4.0.1 (Apple Computer, Inc. build 5370)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> wtf = u'\xe4\xf6\xfc'
>>> wtf
u'\xe4\xf6\xfc'
>>> print wtf
äöü
>>> wtf.encode('UTF-8')
'\xc3\xa4\xc3\xb6\xc3\xbc'
>>> print '\xc3\xa4\xc3\xb6\xc3\xbc'.decode('utf-8')
äöü
Mike Boers
Um. UTF-8 is an already encoded byte stream, so, while not backwards, you got it sideways at least :) Perhaps you meant Unicode instead of UTF-8. I'll edit your post and let you decide.
ΤΖΩΤΖΙΟΥ
Yes, you are right. Thanks!
Mike Boers
+2  A: 

The Unicode strings section of the introductory tutorial explains it well :

To convert a Unicode string into an 8-bit string using a specific encoding, Unicode objects provide an encode() method that takes one argument, the name of the encoding. Lowercase names for encodings are preferred.

>>> u"äöü".encode('utf-8')
'\xc3\xa4\xc3\xb6\xc3\xbc'
dbr
aren't you decoding characters then in your last line?
apphacker
Yep, I've removed my fatigued-wrongness, the unicode strings section explains it better than I can..
dbr
+2  A: 
>>> wtf = '\xe4\xf6\xfc'
>>> wtf
'\xe4\xf6\xfc'
>>> print wtf
���
>>> print wtf.decode("latin-1")
äöü
>>> wtf_unicode = unicode(wtf.decode("latin-1"))
>>> wtf_unicode
u'\xe4\xf6\xfc'
>>> print wtf_unicode
äöü
Renato Besen
+10  A: 

I think there is encode/decode confusion all over the place. You start with an unicode object:

u'\xe4\xf6\xfc'

This is an unicode object, the three characters are the unicode codepoints for "äöü". If you want to turn them into Utf-8, you have to encode them:

>>> u'\xe4\xf6\xfc'.encode('utf-8')
'\xc3\xa4\xc3\xb6\xc3\xbc'

The resulting six characters are the Utf-8 representation of "äöü".

If you call decode(...), you try to interpret the characters as some encoding that still needs to be converted to unicode. Since it already is Unicode, this doesn't work. You first call tries a Ascii to Unicode conversion, the second call a Utf-8 to Unicode conversion. Since u'\xe4\xf6\xfc' is neither valid Ascii nor valid Utf-8 these conversion attempts fail.

Further confusion might come from the fact that '\xe4\xf6\xfc' is also the Latin1/ISO-8859-1 encoding of "äöü". If you write a normal python string (without the leading "u" that marks it as unicode), you can convert it to an unicode object with decode('latin1'):

>>> '\xe4\xf6\xfc'.decode('latin1')
u'\xe4\xf6\xfc'
sth
aha. This finally makes sense.
apphacker
Agreed. Sense has been made.
Mike Boers
+1  A: 

This answer in a related question about encoding/decoding might be of help.

ΤΖΩΤΖΙΟΥ