ansaurus

Question

what's the difference between encode/decode? (python 2.x)

Answer 1

+3 A:

There are a few encodings that can be used to de-/encode from str to str or from unicode to unicode. For example base64, hex or even rot13. They are listed in the codecs module.

Edit:

The decode message on a unicode string can undo the corresponding encode operation:

In [1]: u'0a'.decode('hex')
Out[1]: '\n'

The returned type is str instead of unicode which is unfortunate in my opinion. But when you are not doing a proper en-/decode between str and unicode this looks like a mess anyway.

unbeknown 2009-01-15 15:20:41

-1: The decode method is not being applied to the unicode object. Instead, the unicode object is being encoded as an 'ascii' bytestring, before decode operation starts. For a proof of that assertion, try u'ã'.decode('hex') - that yields UnicodeEncodeError

nosklo 2009-01-16 11:17:30

@nosklo: You are right. What I really meant is that unicode objects have a decode() method so that you can apply non-character-encoding-codecs to them, too. This whole non-character-encoding-business makes this interface a mess in Python < 3.

unbeknown 2009-01-16 19:43:32

Answer 2

+4 A:

mybytestring.encode(somecodec) is meaningful for these values of somecodec:

base64
bz2
zlib
hex
quopri
rot13
string_escape
uu

I am not sure what decoding an already decoded unicode text is good for. Trying that with any encoding seems to always try to encode with the system's default encoding first.

nosklo 2009-01-15 16:15:39

Answer 3

+3 A:

To represent a unicode string as a string of bytes is known as encoding. Use u'...'.encode(encoding).

Example:

    >>> u'æøå'.encode('utf8')
    '\xc3\x83\xc2\xa6\xc3\x83\xc2\xb8\xc3\x83\xc2\xa5'
    >>> u'æøå'.encode('latin1')
    '\xc3\xa6\xc3\xb8\xc3\xa5'
    >>> u'æøå'.encode('ascii')
    UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: 
    ordinal not in range(128)

You typically encode a unicode string whenever you need to use it for IO, for instance transfer it over the network, or save it to a disk file.

To convert a string of bytes to a unicode string is known as decoding. Use unicode('...', encoding) or '...'.decode(encoding).

Example:

   >>> u'æøå'
   u'\xc3\xa6\xc3\xb8\xc3\xa5' # the interpreter prints the unicode object like so
   >>> unicode('\xc3\xa6\xc3\xb8\xc3\xa5', 'latin1')
   u'\xc3\xa6\xc3\xb8\xc3\xa5'
   >>> '\xc3\xa6\xc3\xb8\xc3\xa5'.decode('latin1')
   u'\xc3\xa6\xc3\xb8\xc3\xa5'

You typically decode a string of bytes whenever you receive string data from the network or from a disk file.

I believe there are some changes in unicode handling in python 3, so the above is probably not correct for python 3.

Some good links:

codeape 2009-01-15 20:41:48

Answer 4

+4 A:

The decode method of unicode strings really doesn't have any applications at all (unless you have some non-text data in a unicode string for some reason -- see below). It is mainly there for historical reasons, i think. In Python 3 it is completely gone.

unicode().decode() will perform an implicit encoding of s using the default (ascii) codec. Verify this like so:

>>> s = u'ö'
>>> s.decode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 0:
ordinal not in range(128)

For str().encode() it's the other way around -- it attempts an implicit decoding of s with the default encoding:

>>> s = 'ö'
>>> s.decode('utf-8')
u'\xf6'
>>> s.encode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0:
ordinal not in range(128)

Used like this, str().encode() is also superfluous.

But there is another application of the latter method that is useful: there are encodings that have nothing to do with character sets, and thus can be applied to 8-bit strings in a meaningful way:

>>> s.encode('zip')
'x\x9c;\xbc\r\x00\x02>\x01z'

You are right, though: the ambiguous usage of "encoding" for both these applications is... awkard. Again, with separate byte and string types in Python 3, this is no longer an issue.

hop 2009-01-16 02:06:33

Answer 5

A:

You should read Python UnicodeDecodeError - Am I misunderstanding encode. My understanding of unicode in Python was a lot clearer after reading the accepted answer.

Oli 2009-01-16 08:47:01

Answer 6

A:

anUnicode.encode('encoding') results in a string object and can be called on a unicode object

aString.decode('encoding') results in an unicode object and can be called on a string, encoded in given encoding.

Some more explanations:

You can create some unicode object, which doesn't have any encoding set. The way it is stored by Python in memory is none of your concern. You can search it, split it and call any string manipulating function you like.

But there comes a time, when you'd like to print your unicode object to console or into some text file. So you have to encode it (for example - in UTF-8), you call encode('utf-8') and you get a string with '\u<someNumber>' inside, which is perfectly printable.

Then, again - you'd like to do the opposite - read string encoded in UTF-8 and treat it as an Unicode, so the \u360 would be one character, not 5. Then you decode a string (with selected encoding) and get brand new object of the unicode type.

Just as a side note - you can select some pervert encoding, like 'zip', 'base64', 'rot' and some of them will convert from string to string, but I believe the most common case is one that involves UTF-8/UTF-16 and string.

Abgan 2009-01-16 23:11:40

ansaurus

tags:

views:

answers:

what's the difference between encode/decode? (python 2.x)

related questions