views:

40

answers:

1

I have an app engine app that receives incoming mail with attachments. I check the attachment filename to make sure that the extension is correct. If the filename has umlauts or accented characters in it the encoding makes the filename unreadable to my methods, so I don't know how to check the file type.

For example, if I send a file with name ZumBrückenwirtÜberGrünwaldZurück(2).gpx

And then print out the attachment name like this:

 attachments = [message.attachments]        
 attachmenttype = attachments[0][0][-4:].lower()  
 logging.error("attachment name %s, %s" % (attachments[0][0], attachmenttype))

I get:

attachment name =?ISO-8859-1?B?WnVtQnL8Y2tlbndpcnTcYmVyR3L8bndhbGRadXL8Y2soMikuZ3B4?=, b4?=

+2  A: 

That's an RFC2047 encoded-word. You can partially decode it with the email package, although it still needs stitching together afterwards:

import email.header
def parseHeader(h):
    return ''.join(s.decode(c or 'us-ascii') for s, c in email.header.decode_header(h))

>>> parseHeader('=?ISO-8859-1?B?WnVtQnL8Y2tlbndpcnTcYmVyR3L8bndhbGRadXL8Y2soMikuZ3B4?=')
u'ZumBr\xfcckenwirt\xdcberGr\xfcnwaldZur\xfcck(2).gpx'

It is, however, utterly incorrect to use an encoded-word in the filename="..." parameter for Content-Disposition in an attachment. RFC2047 explicitly states that an encoded-word cannot appear in a quoted paramter. Non-ASCII parameter values are supposed to be transferred using the rules of RFC2231, which look completely different (and very complicated).

So according to the mail standard, you should treat this filename as literally being “=?ISO-8859-1?B?WnVtQnL8Y2tlbndpc...”. I believe it's MS Exchange that generates this nonsense. Try to keep this processing down to a minimum (eg. by only decoding when the string is wrapped in =?...?= which is pretty unlikely for a filename.

bobince