views:

194

answers:

4

It seems to me that built-in functions __repr__ and __str__ have an important difference in their base definition.

>>> t2 = u'\u0131\u015f\u0131k'
>>> print t2
ışık
>>> t2
Out[0]: u'\u0131\u015f\u0131k'

t2.decode raises an error since t2 is a unicode string.

>>> enc = 'utf-8'
>>> t2.decode(enc)
------------------------------------------------------------
Traceback (most recent call last):
  File "<ipython console>", line 1, in <module>
  File "C:\java\python\Python25\Lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordin
al not in range(128)

__str__ raises an error as if decode() function is being called:

>>> t2.__str__()
------------------------------------------------------------
Traceback (most recent call last):
  File "<ipython console>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordin
al not in range(128)

but __repr__ works without problem:

>>> t2.__repr__()
Out[0]: "u'\\u0131\\u015f\\u0131k'"

Why does __str__ produce an error whereas __repr__ work properly?

This small difference seems to cause a bug in one django application that I am working on.

+5  A: 

Basically, __str__ can only output ascii strings. Since t2 contains unicode codepoints above ascii, it cannot be represented with just a string. __repr__, on the other hand, tries to output the python code needed to recreate the object. You'll see that the output from repr(t2) (this syntax is preferred to t2.__repr_()) is exactly what you set t2 equal to up on the first line. The result from repr looks roughly like ['\', 'u', '0', ...], which are all ascii values, but the output from str is trying to be [chr(0x0131), chr(0x015f), chr(0x0131), 'k'], most of which are above the range of characters acceptable in a python string. Generally, when dealing with django applications, you should use __unicode__ for everything, and never touch __str__.

More info in the django documentation on strings.

Michael Fairley
+4  A: 

In general, calling str.__unicode__() or unicode.__str__() is a very bad idea, because bytes can't be safely converted to Unicode character points and vice versa. The exception is ASCII values, which are generally the same in all single-byte encodings. The problem is that you're using the wrong method for conversion.

To convert unicode to str, you should use encode():

>>> t1 = u"\u0131\u015f\u0131k"
>>> t1.encode("utf-8")
'\xc4\xb1\xc5\x9f\xc4\xb1k'

To convert str to unicode, use decode():

>>> t2 = '\xc4\xb1\xc5\x9f\xc4\xb1k'
>>> t2.decode("utf-8")
u'\u0131\u015f\u0131k'
John Millikin
A: 

To add a bit of support to John's good answer:

To understand the naming of the two methods encode() and decode(), you just have to see that Python considers unicode strings of the form u'...' to be in the reference format. You encode going from the referenc format into another format (e.g. utf-8), and you decode from some other format to come to the reference format. The unicode format is always considered the "real thing" :-).

ThomasH
A: 

Note that in Python 3, unicode is the default, and __str__() should always give you unicode.

A. L. Flanagan