tags:

views:

959

answers:

4

I am on python 2.6 for Windows.

I use os.walk t read a file tree. Files may have non-7-bit characters (German "ae" for example) in their filenames. These are encoded in Pythons internal string representation.

I am processing these filenames with Python library functions and that fails due to wrong encoding.

How can I convert these filenames to proper (unicode?) python strings?

I have a file "d:\utest\ü.txt". Passing the path as unicode does not work:

>>> list(os.walk('d:\\utest'))
[('d:\\utest', [], ['\xfc.txt'])]
>>> list(os.walk(u'd:\\utest'))
[(u'd:\\utest', [], [u'\xfc.txt'])]
+10  A: 

If you pass a Unicode string to os.walk(), you'll get Unicode results:

>>> list(os.walk(r'C:\example'))          # Passing an ASCII string
[('C:\\example', [], ['file.txt'])]
>>> 
>>> list(os.walk(ur'C:\example'))        # Passing a Unicode string
[(u'C:\\example', [], [u'file.txt'])]
RichieHindle
A: 

os.walk isn't specified to always use os.listdir, but neither is it listed how Unicode is handled. However, os.listdir does say:

Changed in version 2.3: On Windows NT/2k/XP and Unix, if path is a Unicode object, the result will be a list of Unicode objects. Undecodable filenames will still be returned as string objects.

Does simply using a Unicode argument work for you?

for dirpath, dirnames, filenames in os.walk(u"."):
  print dirpath
  for fn in filenames:
    print "   ", fn
Roger Pate
A: 

No, they are not encoded in Pythons internal string representation, there is no such thing. They are encoded in the encoding of the operating system/file system. Passing in unicode works for os.walk though.

I don't know how os.walk behaves when filenames can't be decoded, but I assume that you'll get a string back, like with os.listdir(). In that case you'll again have problems later. Also, not all of Python 2.x standard library will accept unicode parameters properly, so you may need to encode them as strings anyway. So, the problem may in fact be somewhere else, but you'll notice if that is the case. ;-)

If you need more control of the decoding you can't always pass in a string, and then just decode it with filename = filename.decode() as usual.

Lennart Regebro
Not my down vote, but: If you don't know, don't assume.
John Machin
Oh, excuse me for being more detailed than the other answers and bringing up potential problems with the solution.
Lennart Regebro
+1 for making a reasonable assumption but still explicitly stating that it's an assumption. Another +1 (if I could) for adding value to the discussion.
RichieHindle
The release announcement for Python 3.1 (released two days ago as I write this) says "File system APIs that use unicode strings now handle paths with undecodable bytes in them." I don't know whether that will fix this potential problem, or how, but anyone concerned about it should check out Python 3.1.
RichieHindle
Well, in 3.x you always get unicode back, so just switching to 3.x will likely solve this issue. But of course, theres not many 3rd party modules for 3.x yet. Most notably Setuptools is lacking.
Lennart Regebro
A: 

a more direct way might be to try the following -- find your file system's encoding, and then convert that to unicode. for example,

unicode_name = unicode(filename, "utf-8", errors="ignore")

to go the other way,

unicode_name.encode("utf-8")
gatoatigrado