ansaurus

Question

Why does Python sometimes upgrade a string to unicode and sometimes not?

Answer 1

+8 A:

The Python Language Reference has the answer:

If format is a Unicode object, or if any of the objects being converted using the %s conversion are Unicode objects, the result will also be a Unicode object.

foo = u'Émilie and Juañ are turncoats.'
bar = "foo is %s" % foo

This works, because foo is a unicode object. This causes the above rule to take effect and results in a Unicode string.

bar = "foo2 is %s" % foo2

In this case, foo2 is an Exception object, which is obviously not a unicode object. So the interpreter tries to convert it to a normal str using your default encoding. This, apparently, is ascii, which cannot represent those characters and bails out with an exception.

bar = u"foo2 is %s" % foo2

Here it works again, because the format string is a unicode object. So the interpreter tries to convert foo2 to a unicode object as well, which succeeds.

As to Randall's question: this surprises me too. However, this is according to the standard (reformatted for readability):

%s converts any Python object using str(). If the object or format provided is a unicode string, the resulting string will also be unicode.

How such a unicode object is created is left unclear. So both are legal:

call __str__, decode back to a Unicode string, and insert it into the output string
call __unicode__ and insert the result directly into the output string

The mixed behaviour of the Python interpreter is rather hideous indeed. I would consider this to be a bug in the standard.

Edit: Quoting the Python 3.0 changelog, emphasis mine:

Everything you thought you knew about binary data and Unicode has changed.

[...]

As a consequence of this change in philosophy, pretty much all code that uses Unicode, encodings or binary data most likely has to change. The change is for the better, as in the 2.x world there were numerous bugs having to do with mixing encoded and unencoded text.

Thomas 2010-05-19 17:15:44

And it is using the default encoding (ascii) to convert the unicode to str.

Kathy Van Stone 2010-05-19 17:16:56

But, why doesn't it invoke \_\_str\_\_ on the unicode object in the first example and fail then?

samtregar 2010-05-19 17:17:08

@samtregar, in binary operations between one str and one unicode object, it always tries to promote the str to unicode via the ascii codec -- much like in bin ops between a float and an int it promotes the int to float for you. But a formatting operator with the format being a str and the right-hand side an exception object obviously is NOT a "bin op between a str and a unicode"!-)

Alex Martelli 2010-05-19 17:27:10

Surprising that I got two upvotes for an answer that was wrong. I fixed it. All clear now?

Thomas 2010-05-19 17:33:56

@Thomas Yeah, that helps - it's basically a special magic case. The general case of some object with unicode in it doesn't work but the basic case of a first-class unicode object does. Can you explain Randall's addition to my question? I find that particularly surprising!

samtregar 2010-05-19 17:39:39

See my edits. +1 for a question that turned out way more interesting than it seemed!

Thomas 2010-05-19 18:00:11

ansaurus

tags:

views:

answers:

Why does Python sometimes upgrade a string to unicode and sometimes not?

related questions