From the Python 2.6 shell:
>>> import sys
>>> print sys.getdefaultencoding()
ascii
>>> print u'\xe9'
é
>>>
I expected to have either some gibberish or an Error after the print statement, since the "é" character isn't part of ASCII and I haven't specified an encoding. I guess I don't understand what ASCII being the default encoding means.
EDIT
Thanks to Mark Rushakoff and Ignacio. Both answers help to understand what's going on. Mark indicated that by creating an unicode string, u'\xe9', Python assumed one of the many unicode's encodings (utf8, utf16, utf32....). However, the fact that nothing prints for non-unicode string '\xe9' doesn't appear related at all to ASCII being the default encoding (more on this later). Ignacio then indicated that Python assumes the shell's output to be encoded in UTF-8 which narrows the answer down:
>>> print '\xe9'
>>> print u'\xe9'
é
>>> print u'\xe9'.encode('latin-1')
>>>
After changing my terminal's encoding settings to latin-1, I get this:
>>> print '\xe9'
é
>>> print u'\xe9'
é
>>> print u'\xe9'.encode('latin-1')
é
>>>
My Conclusions:
- Python outputs non-unicode strings as raw data, without considering the default encoding. The terminal just happens to display them if its current encoding matches the data.
- On the other hand, Python outputs Unicode strings after encoding them using the UTF-8 scheme. If your terminal isn't set to decode UTF-8 strings at that moment (e.g. you set it to latin1), you might see some gibberish.
Bonus : For those who struggle with these unicode, utf8 and latin1 questions (like I used too):
Unicode is like a character map where some numbers (code points) are conventionally assigned to characters (we vote that '\xe9' is for 'é'). That map still requires a means to be represented in memory (encoding of Unicode code points, a mouthful). Various schemes exist to do this (utf7, utf8, utf16, utf32, etc). The most intuitive scheme would be to simply use the map values and put them in memory, but unicode has some code points larger than 0xffff and it would be impractical to try a direct 1 to 1 mapping of code points to memory value, as it would require each code point to be stored in 3bytes (which is actually what utf32 does). It's just wasteful to store 'B' (0x42) in 3 bytes. UTF-8 is a scheme able to store code points that don't require that much space in fewer bytes. It laces code points with flag bits to indicate their space requirements and their boundaries.
UTF-8 encoding of unicode code points, from 0x00 up to and including 0x7A (127):
0xxx xxxx (in binary)
- The leading 0 is a flag that indicates to the utf8 decoder that this code point only needs 1 byte. The resulting encoding for code points in this specific range yield the exact same value in memory than their ascii counterparts. Which incidentally makes both encoding schemes compatible in that range.
- the x's are the actual spaces where the code point can be "stored"
e.g. Unicode code point for 'B' is '0x42' or 0100 0010 in binary (it happens to be the same in ASCII). After encoding in UTF-8 it becomes:
0xxx xxxx
*100 0010 <--- Unicode code point
0100 0010 <--- utf8 encoded (exactly the same)
UTF-8 encoding of unicode code points above 127 (non-ascii):
110x xxxx 10xx xxxx
- the leading bits '110' indicate to the utf8 decoder the beginning of a 2 bytes character.
- the next '10' is used to specify the beginning of an inner byte
- most non-ascii text characters require 2 bytes, so this is where they're most likely to be found.
e.g. 'é' Unicode code point is 0xe9 (233). This is only a code point. It still needs to be encoded to be useful in a program.
1110 1001 <-- 0xe9
When utf-8 encodes this value, it determines that the value is above 127 and therefore should be encoded in 2 bytes:
110x xxxx 10xx xxxx
***0 0011 **10 1001
1100 0011 1010 1001
C 3 A 9
The '\XE9' Unicode code points after UTF8 encoding becomes '\XC3\XA9' in memory. Which is exactly how the terminal receives it. If your terminal is set to decode strings using latin-1 (one of the non-unicode legacy encodings), you'll get é, because it just so happens that '\XC3' is decoded in latin1 as the à character and '\XA9' as ©.