views:

59

answers:

2

I'd really like to have my Python application deal exclusively with Unicode strings internally. This has been going well for me lately, but I've run into an issue with handling paths. The POSIX API for filesystems isn't Unicode, so it's possible (and actually somewhat common) for files to have "undecodable" names: filenames that aren't encoded in the filesystem's stated encoding.

In Python, this manifests as a mixture of unicode and str objects being returned from os.listdir().

>>> os.listdir(u'/path/to/foo')
[u'bar', 'b\xe1z']

In that example, the character '\xe1' is encoded in Latin-1 or somesuch, even when the (hypothetical) filesystem reports sys.getfilesystemencoding() == 'UTF-8' (in UTF-8, that character would be the two bytes '\xc3\xa1'). For this reason, you'll get UnicodeErrors all over the place if you try to use, for example, os.path.join() with Unicode paths, because the filename can't be decoded.

The Python Unicode HOWTO offers this advice about unicode pathnames:

Note that in most occasions, the Unicode APIs should be used. The bytes APIs should only be used on systems where undecodable file names can be present, i.e. Unix systems.

Because I mainly care about Unix systems, does this mean I should restructure my program to deal only with bytestrings for paths? (If so, how can I maintain Windows compatibility?) Or are there other, better ways of dealing with undecodable filenames? Are they rare enough "in the wild" that I should just ask users to rename their damn files?

(If it is best to just deal with bytestrings internally, I have a followup question: How do I store bytestrings in SQLite for one column while keeping the rest of the data as friendly Unicode strings?)

+2  A: 

If you need to store bytestrings in a DB that is geared for UNICODE then it is probably easier to record the bytestrings encoded in hex. That way, the hex-encoded string is safe to store as a unicode string in the db.

As for the UNIX pathname issue, my understanding is that there is no particular encoding enforced for filenames so it is entirely possible to have Latin-1, KOI-8-R, CP1252 and others on various files. This means that each component in a pathname could have a separate encoding.

I would be tempted to try and guess the encoding of filenames using something like the chardet module. Of course, there are no guarantees so you still have to handle exceptions, but you would have fewer undecodeable names. Some software replaces undecodeable characters by ? which is non-reversible. I would rather see them replaced with \xdd or \xdddd because it can be manually reversed if necessary. In some applications it may be possible to present the string to a user so that they can key in unicode characters to replace the unencodeable ones.

If you do go down this route, you may end up extending chardet to handle this job. It would be nice to supplement it with a utility that scans a filesystem finding undecodeable names and produces a list that could be edited, then fed back, to fix all the names with unicode equivalents.

Michael Dillon
+1 for the first paragraph - the best way to deal with undecodable data is to avoid decoding it if at all possible. Scan the list and encode everything that is a unicode object back to a byte string using the filesystem encoding. Existing undecodable byte strings should remain untouched.
detly
Yes; thanks for the advice. I've taken the plunge and switched over to byte string paths entirely (at least for Python 2.x). For the record, wrapping str objects in buffer objects before storing them in SQLite prevents them from being automatically decoded as UTF-8.
adrian
+3  A: 

Python does have a solution to the problem, if you're willing to switch to Python 3.1 or later:

PEP 383 - Non-decodable Bytes in System Character Interfaces.

Mark Tolonen
Thank you! I didn't know about this PEP. It's a pretty clever solution.
adrian