ansaurus

Question

What is the relationship between 'unicode' and 'encode'

Answer 1

+6 A:

In Python 2:

case a: (unicode object).encode(somecodec) -> string of bytes
case b: (string of bytes).decode(somecodec) -> unicode object
case c: unicode(string of bytes, somecodec) -> unicode object

Cases b and c are identical. In each of the three cases, you can omit the codec name: then it defaults to 'ascii', the ASCII decoder (supporting only the 128 ASCII characters -- you'll get an exception otherwise).

Whenever a 'string of bytes' is required on the left of the arrow, you can pass a unicode object (it's converted with the 'ascii' codec).

Whenever a 'unicode' is required on the left of the arrow, you can pass a string of bytes (it's converted with the 'ascii' codec).

Alex Martelli 2010-01-08 02:34:07

Answer 2

A:

This is covered in the tutorial and the unicode howto

The unicode function converts non-unicode (by default, ascii, but it accepts other encodings too) strings into unicode. Your error here is that you're passing a string that is already unicode and asking it to be converted to unicode...

The encode function on a unicode string converts it back to a non-unicode encoding - again, ascii is the default.

James Polley 2010-01-08 02:34:21

Answer 3

+5 A:

The encoding error:

print unicode(u'\xe4\xf6\xfc')

The unicode() call does nothing here, since it's parameter is already a unicode object. print then tries to output that unicode object, and to do so print wants to convert it to a string in the encoding of your terminal. But python doesn't seems to know which encoding your terminal uses and therefore goes with the "safe" alternative of Ascii.

Since u'\xe4\xf6\xfc' cannot be represented in Ascii this leads to an encoding error.

Unicode, encode and decode:

Generally encode() converts a unicode object to a string with a certain character encoding like UTF-8 or ISO-8859-1. Every unicode code point is converted to a sequence of bytes in that encoding:

>>> u'\xe4\xf6\xfc'.encode('utf-8')
'\xc3\xa4\xc3\xb6\xc3\xbc'

The opposite is decode(), it converts a string in a certain encoding to a unicode object containing the corresponding unicode codepoints.

>>> '\xc3\xa4\xc3\xb6\xc3\xbc'.decode('utf-8')
u'\xe4\xf6\xfc'

Printing:

print with a string parameter just prints the raw bytes of that string. If that results in the desired output depends on the character encoding of the terminal.

>>> print '\xc3\xa4\xc3\xb6\xc3\xbc'  # utf-8 encoding on utf-8 terminal
äöü
>>> print '\xe4\xf6\xfc'              # same encoded as latin-1
���

When given a unicode parameter, print first tries to encode the unicode object in the terminals encoding. This only works if python guesses the right encoding for the terminal and that encoding can actually represent all the characters of the unicode object. Otherwise the encoding throws exceptions or the output contains wrong characters.

>>> print u'\xe4\xf6\xfc'             # it correctly assumes a utf-8 terminal
äöü

sth 2010-01-08 02:37:49

ansaurus

tags:

views:

answers:

What is the relationship between 'unicode' and 'encode'

The encoding error:

Unicode, encode and decode:

Printing:

related questions