views:

1956

answers:

5

First I chage Windows CMD encoding to utf-8 and run Python interpreter:

    chcp 65001
    python

Then I try to print a unicode sting inside it and when i do this Python crashes in a peculiar way (I just get a cmd prompt in the same window).

    >>> import sys
    >>> print u'ëèæîð'.encode(sys.stdin.encoding)

Any ideas why it happens and how to make it work?

UPD: sys.stdin.encoding returns 'cp65001'

UPD2: It just came to me that the issue might be connected with the fact that utf-8 uses multi-byte character set (kcwu made a good point on that). I tried running the whole example with 'windows-1250' and got 'ëeaî?'. Windows-1250 uses single-character set so it worked for those characters it understands. However I still have no idea how to make 'utf-8' work here.

UPD3: Oh, I found out it is a known Python bug. I guess what happens is that Python copies the cmd encoding as 'cp65001 to sys.stdin.encoding and tries to apply it to all the input. Since it fails to understand 'cp65001' it crushes on any input that contains non-ascii characters.

A: 

This is because "code page" of cmd is different to "mbcs" of system. Although you changed the "code page", python (actually, windows) still think your "mbcs" doesn't change.

kcwu
A: 

Do you want Python to encode to UTF-8?

>>>print u'ëèæîð'.encode('utf-8')
ëèæîð

Python will not recognize cp65001 as UTF-8.

jcoon
+1  A: 

A few comments: you probably misspelled encodig and .code. Here is my run of your example.

C:\>chcp 65001
Active code page: 65001

C:\>\python25\python
...
>>> import sys
>>> sys.stdin.encoding
'cp65001'
>>> s=u'\u0065\u0066'
>>> s
u'ef'
>>> s.encode(sys.stdin.encoding)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
LookupError: unknown encoding: cp65001
>>>

The conclusion - cp65001 is not a known encoding for python. Try 'UTF-16' or something similar.

gimel
Yes, I definitely misspelled it, but I tried it the right way and the same crash (this actually proves that the interpreter didn't actually get to evaluate the misspelled 'encode()' and 'encoding()' attributes and crashed while processing 'ëèæîð'. I fixed the typo.
Alex
A: 

I had this annoying issue, too, and I hated not being able to run my unicode-aware scripts same in MS Windows as in linux. So, I managed to come up with a workaround.

Take this script (say, uniconsole.py in your site-packages or whatever):

import sys, os

if sys.platform == "win32":
    class UniStream(object):
        __slots__= "fileno", "softspace",
        def __init__(self, fileobject):
            self.fileno= fileobject.fileno()
            self.softspace= False
        def write(self, text):
            if isinstance(text, unicode):
                os.write(self.fileno, text.encode("utf_8"))
            else:
                os.write(self.fileno, text)
    sys.stdout= UniStream(sys.stdout)
    sys.stderr= UniStream(sys.stderr)

This seems to work around the python bug (or win32 unicode console bug, whatever). Then I added in all related scripts:

try: import uniconsole
except ImportError: sys.exc_clear() # could be just pass, of course
else: del uniconsole # reduce pollution, not needed anymore

Finally, I just run my scripts as needed in a console where chcp 65001 is run and the font is Lucida Console. (How I wish that DejaVu Sans Mono could be used instead… but hacking the registry and selecting it as a console font reverts to a bitmap font.)

This is a quick-and-dirty stdout and stderr replacement, and also does not handle any raw_input related bugs (obviously, since it doesn't touch sys.stdin at all). And, by the way, I've added the cp65001 alias for utf_8 in the encodings\aliases.py file of the standard lib.

ΤΖΩΤΖΙΟΥ
+1  A: 
David-Sarah Hopwood