views:

487

answers:

3

From the Python 2.6 shell:

>>> import sys
>>> print sys.getdefaultencoding()
ascii
>>> print u'\xe9'
é
>>> 

I expected to have either some gibberish or an Error after the print statement, since the "é" character isn't part of ASCII and I haven't specified an encoding. I guess I don't understand what ASCII being the default encoding means.

EDIT

Thanks to Mark Rushakoff and Ignacio. Both answers help to understand what's going on. Mark indicated that by creating an unicode string, u'\xe9', Python assumed one of the many unicode's encodings (utf8, utf16, utf32....). However, the fact that nothing prints for non-unicode string '\xe9' doesn't appear related at all to ASCII being the default encoding (more on this later). Ignacio then indicated that Python assumes the shell's output to be encoded in UTF-8 which narrows the answer down:

>>> print '\xe9'

>>> print u'\xe9'
é
>>> print u'\xe9'.encode('latin-1')

>>>

After changing my terminal's encoding settings to latin-1, I get this:

>>> print '\xe9'
é
>>> print u'\xe9'
é
>>> print u'\xe9'.encode('latin-1')
é
>>>

My Conclusions:

  • Python outputs non-unicode strings as raw data, without considering the default encoding. The terminal just happens to display them if its current encoding matches the data.
  • On the other hand, Python outputs Unicode strings after encoding them using the UTF-8 scheme. If your terminal isn't set to decode UTF-8 strings at that moment (e.g. you set it to latin1), you might see some gibberish.

Bonus : For those who struggle with these unicode, utf8 and latin1 questions (like I used too):

Unicode is like a character map where some numbers (code points) are conventionally assigned to characters (we vote that '\xe9' is for 'é'). That map still requires a means to be represented in memory (encoding of Unicode code points, a mouthful). Various schemes exist to do this (utf7, utf8, utf16, utf32, etc). The most intuitive scheme would be to simply use the map values and put them in memory, but unicode has some code points larger than 0xffff and it would be impractical to try a direct 1 to 1 mapping of code points to memory value, as it would require each code point to be stored in 3bytes (which is actually what utf32 does). It's just wasteful to store 'B' (0x42) in 3 bytes. UTF-8 is a scheme able to store code points that don't require that much space in fewer bytes. It laces code points with flag bits to indicate their space requirements and their boundaries.

UTF-8 encoding of unicode code points, from 0x00 up to and including 0x7A (127):

0xxx xxxx  (in binary)
  • The leading 0 is a flag that indicates to the utf8 decoder that this code point only needs 1 byte. The resulting encoding for code points in this specific range yield the exact same value in memory than their ascii counterparts. Which incidentally makes both encoding schemes compatible in that range.
  • the x's are the actual spaces where the code point can be "stored"

e.g. Unicode code point for 'B' is '0x42' or 0100 0010 in binary (it happens to be the same in ASCII). After encoding in UTF-8 it becomes:

0xxx xxxx
*100 0010  <--- Unicode code point
0100 0010  <--- utf8 encoded (exactly the same)

UTF-8 encoding of unicode code points above 127 (non-ascii):

110x xxxx 10xx xxxx
  • the leading bits '110' indicate to the utf8 decoder the beginning of a 2 bytes character.
  • the next '10' is used to specify the beginning of an inner byte
  • most non-ascii text characters require 2 bytes, so this is where they're most likely to be found.

e.g. 'é' Unicode code point is 0xe9 (233). This is only a code point. It still needs to be encoded to be useful in a program.

1110 1001    <-- 0xe9

When utf-8 encodes this value, it determines that the value is above 127 and therefore should be encoded in 2 bytes:

110x xxxx 10xx xxxx
***0 0011 **10 1001
1100 0011 1010 1001
C    3    A    9

The '\XE9' Unicode code points after UTF8 encoding becomes '\XC3\XA9' in memory. Which is exactly how the terminal receives it. If your terminal is set to decode strings using latin-1 (one of the non-unicode legacy encodings), you'll get é, because it just so happens that '\XC3' is decoded in latin1 as the à character and '\XA9' as ©.

+1  A: 

The Python REPL tries to pick up what encoding to use from your environment. If it finds something sane then it all Just Works. It's when it can't figure out what's going on that it bugs out.

>>> print sys.stdout.encoding
UTF-8
Ignacio Vazquez-Abrams
just out of curiosity, how would I change sys.stdout.encoding to ascii?
mike
You wouldn't. You'd use `codecs.EncodedFile()` to wrap it.
Ignacio Vazquez-Abrams
+4  A: 

You have specified an encoding by entering an explicit Unicode string. Compare the results of not using the u prefix.

>>> import sys
>>> sys.getdefaultencoding()
'ascii'
>>> '\xe9'
'\xe9'
>>> u'\xe9'
u'\xe9'
>>> print u'\xe9'
é
>>> print '\xe9'

>>> 

In the case of \xe9 then Python assumes your default encoding (Ascii), thus printing ... something blank.

Mark Rushakoff
so if I understand well, when I print out unicode strings (the code points), python assumes that I want an output encoded in utf-8, instead of just trying to give me what it *could* have been in ascii?
mike
@mike: AFAIK what you said is correct. If it *did* print out the Unicode characters but encoded as ASCII, everything would come out garbled and probably all the beginners would be asking, "How come I can't print out Unicode text?"
Mark Rushakoff
@Mark: Thank you. I'm actually one of those beginners, but coming from the side of people who do have some understanding of unicode, which is why this behavior is throwing me off a bit.
mike
@Mark R., not correct, since '\xe9' isn't in the ascii character set. Non-Unicode strings are printed using sys.stdout.encoding, Unicode strings are encoded to sys.stdout.encoding before printing.
Mark Tolonen
+1  A: 

When Unicode characters are printed to stdout, sys.stdout.encoding is used. A non-Unicode character is assumed to be in sys.stdout.encoding and is just sent to the terminal. On my system:

>>> import unicodedata as ud
>>> import sys
>>> sys.stdout.encoding
'cp437'
>>> ud.name(u'\xe9')
'LATIN SMALL LETTER E WITH ACUTE'
>>> ud.name('\xe9'.decode('cp437'))
'GREEK CAPITAL LETTER THETA'
>>> import unicodedata as ud
>>> ud.name(u'\xe9')
'LATIN SMALL LETTER E WITH ACUTE'
>>> '\xe9'.decode('cp437')
u'\u0398'
>>> ud.name(u'\u0398')
'GREEK CAPITAL LETTER THETA'
>>> print u'\xe9'
é
>>> print '\xe9'
Θ

sys.getdefaultencoding() is only used when Python doesn't have another option.

Mark Tolonen
I'll accept this answer because it's the closest to a complete answer, even though it was written after I understood what was going on.
mike