views:

104

answers:

2

Possible Duplicate:
Python, Unicode, and the Windows console

I have a folder with a filename "01 - ナナナン塊.txt"

I open python at the interactive prompt in the same folder as the file and attempt to walk the folder hierachy:

Python 3.1.2 (r312:79149, Mar 21 2010, 00:41:52) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> for x in os.walk('.'):
...     print(x)
...
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "C:\dev\Python31\lib\encodings\cp850.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 17-21: character maps to <undefined>

Clearly the encoding I'm using isn't able to deal with Japanese characters. Fine. But Python 3.1 is meant to be unicode all the way down, as I understand it, so I'm at a loss as to what I'm meant to do with this. Anyone have any ideas?

A: 

For hard-coded strings, you'll need to specify the encoding at the top of source files. For bytestrings input from some other source - such as os.walk -, you need to specify the byte string's encoding (see unutbu's answer).

André Caron
There are no byte strings in Windows, only UTF-16 strings.
Philipp
@Philipp: All Windows-NT based kernel know only UTF-16 strings. You can still invoke ANSI version of all Win32 API, such as `FindFirstFileA()` to get a fodler listing containing what Python calls bytestrings. I assume this is what Python does because on my Windows machine, `os.walk()` with Python 2.6.5 returns items of class `str`, which are byte strings.
André Caron
I'm using Python 3 which is entirely utf-8.http://www.python.org/dev/peps/pep-3120/
Tom Whittock
Strings in Python 3 are either UTF-16 or UTF-32, but not UTF-8.
Philipp
@Philipp: sorry, i was responding to the source file encoding thing, should have made that clearer
Tom Whittock
+2  A: 

It seems like all answers so far are from Unix people who assume the Windows console is like a Unix terminal, which it is not.

The problem is that you can't write Unicode output to the Windows console using the normal underlying file I/O functions. The Windows API WriteConsole needs to be used. Python should probably be doing this transparently, but it isn't.

There's a different problem if you redirect the output to a file: Windows text files are historically in the ANSI codepage, not Unicode. You can fairly safely write UTF-8 to text files in Windows these days, but Python doesn't do that by default.

I think it should do these things, but here's some code to make it happen. You don't have to worry about the details if you don't want to; just call ConsoleFile.wrap_standard_handles(). You do need PyWin installed to get access to the necessary APIs.

import os, sys, io, win32api, win32console, pywintypes

def change_file_encoding(f, encoding):
    """
    TextIOWrapper is missing a way to change the file encoding, so we have to
    do it by creating a new one.
    """

    errors = f.errors
    line_buffering = f.line_buffering
    # f.newlines is not the same as the newline parameter to TextIOWrapper.
    # newlines = f.newlines

    buf = f.detach()

    # TextIOWrapper defaults newline to \r\n on Windows, even though the underlying
    # file object is already doing that for us.  We need to explicitly say "\n" to
    # make sure we don't output \r\r\n; this is the same as the internal function
    # create_stdio.
    return io.TextIOWrapper(buf, encoding, errors, "\n", line_buffering)


class ConsoleFile:
    class FileNotConsole(Exception): pass

    def __init__(self, handle):
        handle = win32api.GetStdHandle(handle)
        self.screen = win32console.PyConsoleScreenBufferType(handle)
        try:
            self.screen.GetConsoleMode()
        except pywintypes.error as e:
            raise ConsoleFile.FileNotConsole

    def write(self, s):
        self.screen.WriteConsole(s)

    def close(self): pass
    def flush(self): pass
    def isatty(self): return True

    @staticmethod
    def wrap_standard_handles():
        sys.stdout.flush()
        try:
            # There seems to be no binding for _get_osfhandle.
            sys.stdout = ConsoleFile(win32api.STD_OUTPUT_HANDLE)
        except ConsoleFile.FileNotConsole:
            sys.stdout = change_file_encoding(sys.stdout, "utf-8")

        sys.stderr.flush()
        try:
            sys.stderr = ConsoleFile(win32api.STD_ERROR_HANDLE)
        except ConsoleFile.FileNotConsole:
            sys.stderr = change_file_encoding(sys.stderr, "utf-8")

ConsoleFile.wrap_standard_handles()

print("English 漢字 Кири́ллица")

This is a little tricky: if stdout or stderr is the console, we need to output with WriteConsole; but if it's not (eg. foo.py > file), that's not going to work, and we need to change the file's encoding to UTF-8 instead.

The opposite in either case will not work. You can't output to a regular file with WriteConsole (it's not actually a byte API, but a UTF-16 one; PyWin hides this detail), and you can't write UTF-8 to a Windows console.

Also, it really should be using _get_osfhandle to get the handle to stdout and stderr, rather than assuming they're assigned to the standard handles, but that API doesn't seem to have any PyWin binding.

Glenn Maynard
+1 – you seem to be the first to actually understand the problem. I think the problem with `WriteConsoleW` vs. `WriteFile` is known in the Python community, but actually implementing the distinction seems to be difficult or at least unpopular.
Philipp
Python is developed largely by Unix people, and spending time on the odd details of other peoples' platforms is never appealing--but this really is important. Major parts of Python in Windows (eg. `print`) should *not* be limited to '95-era (actually, these date back to DOS) ANSI codepages.
Glenn Maynard
Wow. This is what I need to do to display a unicode string in the standard command window in Windows. If it wasn't so sad, it would be funny. Thank you very much for doing all that hard work of implementing the output streams properly.
Tom Whittock
Fortunately Python seems to be less Linux-centric than many other OSS projects: the developers are actively working towards better Windows support and accept that Windows is an important platform and not the devil himself. If somebody submitted a patch to switch console output to `WriteConsoleW` it would have a high chance of being integrated.
Philipp
@Tom: consider yourself lucky that Python can even cope with Unicode filenames. Try this with something like PHP or Ruby and you wouldn't even be able to open the file. It's hugely unfortunate that the MS C runtime (on which Python and other languages are built) insists on using the system default codepage for stdio byte interfaces instead of UTF-8.
bobince
@bobince: It does that for compatibility, which is something--let's be honest--Windows is far better at than Linux, which in general doesn't care about backwards compatibility beyond maybe a year or so at all. (Try building a binary for a Linux system that's five years old.) That said, it'd help a lot if Windows had an API call to change the ACP to UTF-8; one gets the sense that they don't do *that* on purpose, just to make the lives of non-Windows-centric programmers harder...
Glenn Maynard