I'm trying to understand how python 2.5 deals with unicode strings. Although by now I think I have a good grasp of how I'm supposed to handle them in code, I don't fully understand what's going on behind the scenes, particularly when you type strings at the interpreter's prompt.
So python pre 3.0 has two types for strings, namely: str
(byte strings) and unicode
, which are both derived from basestring
. The default type for strings is str
.
str
objects have no notion of their actual encoding, they are just bytes. Either you've encoded a unicode string yourself and therefore know what encoding they are in, or you've read a stream of bytes whose encoding you also know beforehand (indeally). You can guess the encoding of a byte string whose encoding is unknown to you, but there just isn't a reliable way of figuring this out. Your best bet is to decode early, use unicode everywhere in your code and encode late.
That's fine. But strings typed into the interpreter are indeed encoded for you behind your back? Provided that my understanding of strings in Python is correct, what's the method/setting python uses to make this decision?
The source of my confusion is the differing results I get when I try the same thing on my system's python installation, and on my editor's embedded python console.
# Editor (Sublime Text)
>>> s = "La caña de España"
>>> s
'La ca\xc3\xb1a de Espa\xc3\xb1a'
>>> s.decode("utf-8")
u'La ca\xf1a de Espa\xf1a'
>>> sys.getdefaultencoding()
'ascii'
# Windows python interpreter
>>> s= "La caña de España"
>>> s
'La ca\xa4a de Espa\xa4a'
>>> s.decode("utf-8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python25\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa4 in position 5: unexpected code byte
>>> sys.getdefaultencoding()
'ascii'