views:

23

answers:

2

Hello,

I'm writing a python3 program, that gets the names of files to process from command-line arguments. I'm confused regarding what is the proper way to handle different encodings.

I think I'd rather consider filenames as bytes and not strings, since that avoids the danger of using an incorrect encoding. Indeed, some of my file names use an incorrect encoding (latin1 when my system locale uses utf-8), but that doesn't prevent tools like ls from working. I'd like my tool to be resilient to that as well.

I have two problems: the command-line arguments are given to me as strings (I use argparse), and I want to report errors to the user as strings.

I've successfuly adapted my code to use binaries, and my tool can handle files whose name are invalid in the current default encoding, as long as it is by recursing trough the filesystem, because I convert the arguments to binaries early, and use binaries when calling fs functions. When I receive a filename argument which is invalid, however, it is handed to me as a unicode string with strange characters like \udce8. I do not know what these are, and trying to encode it always fail, be it with utf8 or with the corresponding (wrong) encoding (latin1 here).

The other problem is for reporting errors. I expect users of my tool to parse my stdout (hence wanting to preserve filenames), but when reporting errors on stderr I'd rather encode it in utf-8, replacing invalid sequences with appropriate "invalid/question mark" characters.

So,

1) Is there a better, completely different way to do it ? (yes, fixing the filenames is planned, but i'd still like my tool to be robust)

2) How do I get the command line arguments in their original binary form (not pre-decoded for me), knowing that for invalid sequences re-encoding the decoded argument will fail, and

3) How do I tell the utf-8 codec to replace invalid, undecodable sequences with some invalid mark rather than dying on me ?

+2  A: 

When I receive a filename argument which is invalid, however, it is handed to me as a unicode string with strange characters like \udce8.

Those are surrogate characters. The low 8 bits is the original invalid byte.

See PEP 383: Non-decodable Bytes in System Character Interfaces.

dan04
Okay, that addresses point n°2. So the correct course is just to add 'surrogateescape' when I encode the CLI arguments only ? or handle everything as strings using the surrogate facility ? What about the other points ?
b0fh
At least my biggest concern is addressed, so answer accepted !
b0fh
Python 3.1 (but not 3.0) should handle the surrogateescape automatically. Just treat filenames as strings.
dan04
A: 

Don't go against the grain: filenames are strings, not bytes.

You shouldn't use a bytes when you should use a string. A bytes is a tuple of integers. A string is a tuple of characters. They are different concepts. What you're doing is like using an integer when you should use a boolean.

(Aside: Python stores all strings in-memory under Unicode; all strings are stored the same way. Encoding specifies how Python converts the on-file bytes into this in-memory format.)

Your operating system stores filenames as strings under a specific encoding. I'm surprised you say that some filenames have different encodings; as far as I know, the filename encoding is system-wide. Functions like open default to the default system filename encoding, for example.

Beau Martínez
I wish my OS considered filenames as strings, but there are many things that hint this is not the case. I try running ls with different locale settings, and it stills hands me the same exact sequence of bytes. No transcoding is performed if the locale differs from the filesystem encoding.
b0fh
I'd have used strings everywhere (that's what I did in the first place) but it did not work, and now I think the reason was that one of the libs I'm using (pyxattr) failed at handling surrogates.
b0fh