views:

315

answers:

2

I have a Python extension to the Nautilus file browser (AFAIK this runs exclusively on GNU/Linux/Unix/etc environments). I decided to split out an expensive computation and run it as a subprocess, pickle the result and send it back over a pipe. My question concerns the arguments to the script. Since the computation requires a path argument and a boolean argument I figured I could do this in two ways: send the args in a pickled tuple over a pipe, or give them on the command line. I found that the pickled tuple approach is noticeably slower than just giving arguments, so I went with the subprocess argument approach.

However, I'm worried about localisation issues that might arise. At present, in the caller I have:

subprocess.Popen(
    [sys.executable, path_to_script, path.encode("utf-8"), str(recurse)],
    stdin = None,
    stdout = subprocess.PIPE)

In the script:

path = unicode(sys.argv[1], "utf-8")

My concern is that encoding the path argument as UTF-8 is a mistake, but I don't know for sure. I want to avoid a "it works on my machine" syndrome. Will this fail if a user has, say, latin1 as their default character encoding? Or does it not matter?

+3  A: 

It does not matter: as long as your script knows to expect a utf-8 encoding for the argument, it can decode it properly. utf-8 is the correct choice because it will let you encode ANY Unicode string -- not just those for some languages but not others, as choices such as Latin-1 would entail!

Alex Martelli
+1 args don't know anything about Unicode, they just pass bytes and it's up to you to interpret them.
bobince
Okay, that was kind of the essence of the question: "do the args have to be a user-locale encoded string, or can they just be any arbitrary sequence of bytes?" It looks like the answer is the latter.Thanks :)
detly
@detly, yep, good way of putting it! The process receiving the args is responsible for decoding them (if needed), that has costs of course, but it also has pluses, such as allowing "arbitrary sequences of bytes" as you say (as long as whoever's sending the sequence, and its recipient, are aligned about its meaning, encoding, and so forth;-).
Alex Martelli
+1  A: 

Use sys.getfilesystemencoding() if file names should be readable by user. However this can cause problems when there are characters not supported by the system encoding. To avoid this you can substitute missing characters with some character sequences (e.g. by registering you own error handling function with codecs.register_error()).

Denis Otkidach
At this particular point in the program, the paths should not be visible to the user. It's just as a mechanism for passing the paths to a worker subprocess.
detly