views:

57

answers:

3

When printing a formatted string with a fixed length (e.g, %20s), the width differs from UTF-8 string to a normal string:

>>> str1="Adam Matan"
>>> str2="אדם מתן"
>>> print "X %20s X" % str1
X           Adam Matan X
>>> print "X %20s X" % str2
X        אדם מתן X

Note the difference:

X           Adam Matan X
X        אדם מתן X

Any ideas?

+1  A: 

Try this way:

>>> str1="Adam Matan"
>>> str2=unicode("אדם מתן", "utf8")
>>> print "X %20s X" % str2
X              אדם מתן X
>>> print "X %20s X" % str1
X           Adam Matan X
Michał Kwiatkowski
+1 for the `unicode` function
Adam Matan
+5  A: 

You need to specify that the second string is Unicode by putting u in front of the string:

>>> str1="Adam Matan"
>>> str2=u"אדם מתן"
>>> print "X %20s X" % str1
X           Adam Matan X
>>> print "X %20s X" % str2
X              אדם מתן X

Doing this lets Python know that it's counting Unicode characters, not just bytes.

tghw
+1 for nice explanation. May want to checkout this tutorial for a better understanding http://sebsauvage.net/python/snyppets/#unicode
rubayeet
+1  A: 

In Python 2 unprefixed string literals are of type str, which is a byte string. It stores arbitrary bytes, not characters. UTF-8 encodes some characters with more than one bytes. str2 therefore contains more bytes than actual characters, and shows the unexpected, but perfectly valid behaviour in string formatting. If you look at the actual byte content of these strings (use repr instead of print), you'll see, that in both strings the field is actually 20 bytes (not characters!) long.

As already mentioned, the solution is to use unicode strings. When working with strings in Python, you absolutely need to understand and realize the difference between unicode and byte strings.

lunaryorn
+1 For profound explanation
Adam Matan