ansaurus

Question

Why does Python print unicode characters when the default encoding is ASCII?

Answer 1

+1 A:

The Python REPL tries to pick up what encoding to use from your environment. If it finds something sane then it all Just Works. It's when it can't figure out what's going on that it bugs out.

>>> print sys.stdout.encoding
UTF-8

Ignacio Vazquez-Abrams 2010-04-08 00:07:43

just out of curiosity, how would I change sys.stdout.encoding to ascii?

mike 2010-04-08 00:14:55

You wouldn't. You'd use `codecs.EncodedFile()` to wrap it.

Ignacio Vazquez-Abrams 2010-04-08 00:21:38

Answer 2

+4 A:

You have specified an encoding by entering an explicit Unicode string. Compare the results of not using the u prefix.

>>> import sys
>>> sys.getdefaultencoding()
'ascii'
>>> '\xe9'
'\xe9'
>>> u'\xe9'
u'\xe9'
>>> print u'\xe9'
é
>>> print '\xe9'

>>>

In the case of \xe9 then Python assumes your default encoding (Ascii), thus printing ... something blank.

Mark Rushakoff 2010-04-08 00:08:07

so if I understand well, when I print out unicode strings (the code points), python assumes that I want an output encoded in utf-8, instead of just trying to give me what it *could* have been in ascii?

mike 2010-04-08 00:30:29

@mike: AFAIK what you said is correct. If it *did* print out the Unicode characters but encoded as ASCII, everything would come out garbled and probably all the beginners would be asking, "How come I can't print out Unicode text?"

Mark Rushakoff 2010-04-08 00:38:52

@Mark: Thank you. I'm actually one of those beginners, but coming from the side of people who do have some understanding of unicode, which is why this behavior is throwing me off a bit.

mike 2010-04-08 00:46:36

@Mark R., not correct, since '\xe9' isn't in the ascii character set. Non-Unicode strings are printed using sys.stdout.encoding, Unicode strings are encoded to sys.stdout.encoding before printing.

Mark Tolonen 2010-04-08 02:41:56

Answer 3

+1 A:

When Unicode characters are printed to stdout, sys.stdout.encoding is used. A non-Unicode character is assumed to be in sys.stdout.encoding and is just sent to the terminal. On my system:

>>> import unicodedata as ud
>>> import sys
>>> sys.stdout.encoding
'cp437'
>>> ud.name(u'\xe9')
'LATIN SMALL LETTER E WITH ACUTE'
>>> ud.name('\xe9'.decode('cp437'))
'GREEK CAPITAL LETTER THETA'
>>> import unicodedata as ud
>>> ud.name(u'\xe9')
'LATIN SMALL LETTER E WITH ACUTE'
>>> '\xe9'.decode('cp437')
u'\u0398'
>>> ud.name(u'\u0398')
'GREEK CAPITAL LETTER THETA'
>>> print u'\xe9'
é
>>> print '\xe9'
Θ

sys.getdefaultencoding() is only used when Python doesn't have another option.

Mark Tolonen 2010-04-08 02:47:04

I'll accept this answer because it's the closest to a complete answer, even though it was written after I understood what was going on.

mike 2010-04-08 05:01:54

ansaurus

tags:

views:

answers:

Why does Python print unicode characters when the default encoding is ASCII?

related questions