views:

37

answers:

2

Can someone confirm that Python 2.6 ftplib does NOT support Unicode file names? Or must Unicode file names be specially encoded in order to be used with the ftplib module?

The following email exchange seems to support my conclusion that the ftplib module only supports ASCII file names.

Should ftplib use UTF-8 instead of latin-1 encoding? http://mail.python.org/pipermail/python-dev/2009-January/085408.html

Any recommendations on a 3rd party Python FTP module that supports Unicode file names? I've googled this question without success [1], [2].

The official Python documentation does not mention Unicode file names [3].

Thank you, Malcolm

[1] ftputil wraps ftplib and inherits ftplib's apparent ASCII only support?

[2] Paramiko's SFTP library does support Unicode file names, however I'm looking specifically for ftp (vs. sftp) support relative to our current project.

[3] http://docs.python.org/library/ftplib.html

WORKAROUND:

The encodings.idna.ToASCII and .ToUnicode methods can be used to convert Unicode path names to an ASCII format. If you wrap all your remote path names and the output of the dir/nlst methods with these functions, then you can create a way to preserve Unicode path names using the standard ftplib (and also preserve Unicode file names on file systems that don't support Unicode paths). The downside to this technique is that other processes on the server will also have to use encodings.idna when referencing the files that you upload to the server. BTW: I understand that this is an abuse of the encodings.idna library.

Thank you Peter and Bob for your comments which I found very helpful.

+1  A: 

Personally I would be more worried about what is on the other side of the ftp connection than the support of the library. FTP is a brittle protocol at the best of times without trying to be creative with filenames.

from RFC 959:

     Pathname is defined to be the character string which must be
     input to a file system by a user in order to identify a file.
     Pathname normally contains device and/or directory names, and
     file name specification.  FTP does not yet specify a standard
     pathname convention.  Each user must follow the file naming
     conventions of the file systems involved in the transfer.

To me that means that the filenames should conform to the lowest common denominator. Since nowadays the number of DOS servers, Vax and IBM mainframes is negligeable and chances are you'll end up on a Windows or Unix box so the common denominator is quite high, but making assumptions on which codepage the remote site wants to accept appears to me pretty risky.

Peter Tillemans
@Peter: I wish I could mark both responses as answers as I found your response very helpful as well. Thank you for your help! Regards, Malcolm
Malcolm
+2  A: 

ftplib has no knowledge of Unicode whatsoever. It is intended to be passed byte-strings for filenames, and it'll return byte strings when asked for a directory list. Those are the exact strings of bytes passed-to/returned-from the server.

If you pass a Unicode string to ftplib in Python 2.x, it'll end up getting coerced to bytes when it's sent to the underlying socket object. This coercion uses Python's ‘default’ encoding, ie. US-ASCII for safety, with exceptions generated for non-ASCII characters.

The python-dev message to which you linked is talking about ftplib in Python 3.x, where strings are Unicode by default. This leaves modules like ftplib in a tricky situation because although they now use Unicode strings at their front-end, the actual protocol behind it is byte-based. There therefore has to be an extra level of encoding/decoding involved, and without explicit intervention to specify what encoding is in use, there's a fair change it'll choose wrong.

ftplib in 3.x chose to default to ISO-8859-1 in order to preserve each byte as a character inside the Unicode string. Unfortunately this will give unexpected results in the common case where the target server uses a UTF-8 collation for filenames (whether or not the FTP daemon itself knows that filenames are UTF-8, which it commonly won't). There are a number of cases like this where the Python standard libraries have been brutally hacked to Unicode strings with negative consequences; Python 3's batteries-included are still leaking corrosive fluid IMO.

bobince
@Bobince: Thank you! Regards, Malcolm
Malcolm