The encoding error:
print unicode(u'\xe4\xf6\xfc')
The unicode()
call does nothing here, since it's parameter is already a unicode object. print
then tries to output that unicode object, and to do so print
wants to convert it to a string in the encoding of your terminal. But python doesn't seems to know which encoding your terminal uses and therefore goes with the "safe" alternative of Ascii.
Since u'\xe4\xf6\xfc'
cannot be represented in Ascii this leads to an encoding error.
Unicode, encode and decode:
Generally encode()
converts a unicode object to a string with a certain character encoding like UTF-8 or ISO-8859-1. Every unicode code point is converted to a sequence of bytes in that encoding:
>>> u'\xe4\xf6\xfc'.encode('utf-8')
'\xc3\xa4\xc3\xb6\xc3\xbc'
The opposite is decode()
, it converts a string in a certain encoding to a unicode object containing the corresponding unicode codepoints.
>>> '\xc3\xa4\xc3\xb6\xc3\xbc'.decode('utf-8')
u'\xe4\xf6\xfc'
Printing:
print
with a string parameter just prints the raw bytes of that string. If that results in the desired output depends on the character encoding of the terminal.
>>> print '\xc3\xa4\xc3\xb6\xc3\xbc' # utf-8 encoding on utf-8 terminal
äöü
>>> print '\xe4\xf6\xfc' # same encoded as latin-1
���
When given a unicode parameter, print
first tries to encode the unicode object in the terminals encoding. This only works if python guesses the right encoding for the terminal and that encoding can actually represent all the characters of the unicode object. Otherwise the encoding throws exceptions or the output contains wrong characters.
>>> print u'\xe4\xf6\xfc' # it correctly assumes a utf-8 terminal
äöü