tags:

views:

81

answers:

1

I'm using os.walk to create a list of all music files under a folder. Some of these filenames are non-ascii, for example:

01 空即是色.mp3

I'm using the mutagen library to parse metadata for this file, and it professes complete unicode support. The filename is being retrieved as unicode, and can be printed as unicode. However, no matter what I do (including normalising the unicode beforehand, or encoding it as utf-8 beforehand), mutagen attempts to open()

01 \xe7\xa9\xba\xe5\x8d\xb3\xe6\x98\xaf\xe8\x89\xb2.mp3

or

01 \u7a7a\u5373\u662f\u8272.mp3

How can I force it to open() the correct filename (the one it is perfectly capable of printing)?

The full code is here.

Note: I am rather new to python and programming in general, any advice you could give in regards to my code would be very much appreciated. Thanks in advance

EDIT: Okay, this is a rather embarrassing error of mine, the problem was not the character encoding, it was the fact that the path was not being appended to the open() call. How do I find the full path for a file found via walk()? The files are 2-3 directories deep.

+2  A: 

Note that walk(dir) returns the filename without path. If you want to open the file, you must prepend dir:

for dirpath, dirnames, filenames in os.walk(dir):
    for filename in filenames:
        path = os.path.join(dirpath, filename)
Aaron Digulla
Ouch. That hurts. Does that API date back to something like 1970 or so?
Joey
Not really. Python 2.6 has two string types: One is byte based and the other is unicode (16bit) based. No filesystems in the world supports Unicode but some can handle UTF-8 encoded names (Linux or Windows, for example). The main difference to Windows is that Windows has an API which you can pass Unicode strings and it will do the conversion internally. In Python, you just have to do it in your code (upto version 3.0). This is mainly to support many OSs.
Aaron Digulla
However, even when encoded as a bytestring, I get IOError: [Errno 2] No such file or directory: '01 \xe7\xa9\xba\xe5\x8d\xb3\xe6\x98\xaf\xe8\x89\xb2.mp3'
Ripdog
Aaron: NTFS uses UTF-16 for file names. Exclusively. Windows APIs also only use UTF-16 for that purpose so Python does convert there already. OS X uses UTF-8 in NFD, iirc so maybe normalization has to be done within Python (unless the Unicode string is already normalized). I also didn't mean the single-byte/Unicode string dichotomy in Python (I think it's a bad idea but I know about it). It's more that if your language supports Unicode strings you can expect its APIs to handle them too.
Joey
@Ripdog: Okay ... Is that file in the same directory where you started the script? Otherwise, you forgot to prepend the path. Try `path = os.path.join(directory, filename)` in the loop in `startScan()`
Aaron Digulla
Well, that's pretty embarrassing. It seems that you are right, but now I have a new problem. How do I get the full pathname of files from walk() when the files are 3 directories deep from the walk start point?
Ripdog
Read the docs to `walk` carefully: http://docs.python.org/library/os.html#os.walk The first parameter is the path for all items in the second and third list.
Aaron Digulla
-1 it seems wrong. `os.walk` will output unicode if started with unicode. `open` will accept unicode. If it doesn't work, try `io.open` as well.
kaizer.se
@kaizer.se: Is that also true for Python 2.6? IIRC, this only works with Python 3+
Aaron Digulla
I don't know about os.walk, but I know that os.listdir() returns unicode if you feed it unicode, and byte strings if you feed it a byte string. We can assume it's the same for os.walk.
Virgil Dupras
Fixed my answer.
Aaron Digulla