I start by creating a string variable with some non-ascii utf-8 encoded data on it:
>>> text = 'á'
>>> text
'\xc3\xa1'
>>> text.decode('utf-8')
u'\xe1'
Using unicode()
on it raises errors...
>>> unicode(text)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0:
ordinal not in range(128)
...but if I know the encoding I can use it as second parameter:
>>> unicode(text, 'utf-8')
u'\xe1'
>>> unicode(text, 'utf-8') == text.decode('utf-8')
True
Now if I have a class that returns this text in the __str__()
method:
>>> class ReturnsEncoded(object):
... def __str__(self):
... return text
...
>>> r = ReturnsEncoded()
>>> str(r)
'\xc3\xa1'
unicode(r)
seems to use str()
on it, since it raises the same error as unicode(text)
above:
>>> unicode(r)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0:
ordinal not in range(128)
Until now everything is as planned!
But as no one would ever expect, unicode(r, 'utf-8')
won't even try:
>>> unicode(r, 'utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: coercing to Unicode: need string or buffer, ReturnsEncoded found
Why? Why this inconsistent behavior? Is it a bug? is it intended? Very awkward.