tags:

views:

162

answers:

1

I'm confused. Consider this code working the way I expect:

>>> foo = u'Émilie and Juañ are turncoats.'
>>> bar = "foo is %s" % foo
>>> bar
u'foo is \xc3\x89milie and Jua\xc3\xb1 are turncoats.'

And this code not at all working the way I expect:

>>> try:
...     raise Exception(foo)
... except Exception as e:
...     foo2 = e
... 
>>> bar = "foo2 is %s" % foo2
------------------------------------------------------------
Traceback (most recent call last):
  File "<ipython console>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

Can someone explain what's going on here? Why does it matter whether the unicode data is in a plain unicode string or stored in an Exception object? And why does this fix it:

>>> bar = u"foo2 is %s" % foo2
>>> bar
u'foo2 is \xc3\x89milie and Jua\xc3\xb1 are turncoats.'

I am quite confused! Thanks for the help!

UPDATE: My coding buddy Randall has added to my confusion in an attempt to help me! Send in the reinforcements to explain how this is supposed to make sense:

>>> class A:
...     def __str__(self): return "string"
...     def __unicode__(self): return "unicode"
... 
>>> "%s %s" % (u'niño', A())
u'ni\xc3\xb1o unicode'
>>> "%s %s" % (A(), u'niño')
u'string ni\xc3\xb1o'

Note that the order of the arguments here determines which method is called!

+8  A: 

The Python Language Reference has the answer:

If format is a Unicode object, or if any of the objects being converted using the %s conversion are Unicode objects, the result will also be a Unicode object.

foo = u'Émilie and Juañ are turncoats.'
bar = "foo is %s" % foo

This works, because foo is a unicode object. This causes the above rule to take effect and results in a Unicode string.

bar = "foo2 is %s" % foo2

In this case, foo2 is an Exception object, which is obviously not a unicode object. So the interpreter tries to convert it to a normal str using your default encoding. This, apparently, is ascii, which cannot represent those characters and bails out with an exception.

bar = u"foo2 is %s" % foo2

Here it works again, because the format string is a unicode object. So the interpreter tries to convert foo2 to a unicode object as well, which succeeds.


As to Randall's question: this surprises me too. However, this is according to the standard (reformatted for readability):

%s converts any Python object using str(). If the object or format provided is a unicode string, the resulting string will also be unicode.

How such a unicode object is created is left unclear. So both are legal:

  • call __str__, decode back to a Unicode string, and insert it into the output string
  • call __unicode__ and insert the result directly into the output string

The mixed behaviour of the Python interpreter is rather hideous indeed. I would consider this to be a bug in the standard.

Edit: Quoting the Python 3.0 changelog, emphasis mine:

Everything you thought you knew about binary data and Unicode has changed.

[...]

  • As a consequence of this change in philosophy, pretty much all code that uses Unicode, encodings or binary data most likely has to change. The change is for the better, as in the 2.x world there were numerous bugs having to do with mixing encoded and unencoded text.
Thomas
And it is using the default encoding (ascii) to convert the unicode to str.
Kathy Van Stone
But, why doesn't it invoke \_\_str\_\_ on the unicode object in the first example and fail then?
samtregar
@samtregar, in binary operations between one str and one unicode object, it always tries to promote the str to unicode via the ascii codec -- much like in bin ops between a float and an int it promotes the int to float for you. But a formatting operator with the format being a str and the right-hand side an exception object obviously is NOT a "bin op between a str and a unicode"!-)
Alex Martelli
Surprising that I got two upvotes for an answer that was wrong. I fixed it. All clear now?
Thomas
@Thomas Yeah, that helps - it's basically a special magic case. The general case of some object with unicode in it doesn't work but the basic case of a first-class unicode object does. Can you explain Randall's addition to my question? I find that particularly surprising!
samtregar
See my edits. +1 for a question that turned out way more interesting than it seemed!
Thomas