views:

41

answers:

2

I want to import emails from an mbox format into a Django app. All database tables are Unicode. My problem: sometimes the wrong charset is given, sometimes none at all. What is the best way to deal with these encoding issues?

So far, I merely nest exceptions to try the two most common charsets I receive mails in (utf-8 and iso-8859-1):

    if (not message.is_multipart()):
        message_charset = message.get_content_charset()
        msg.message = message_charset + unicode(message.get_payload(decode=False), message_charset)
    else:
        for part in message.walk():
            if part.get_content_type() == "text/plain":
                message_charset = part.get_content_charset()
                try:
                    msg.message = message_charset + unicode(part.get_payload(decode=False), message_charset)
                except(UnicodeDecodeError):
                    try:
                        msg.message = message_charset + unicode(part.get_payload(decode=False), "utf-8")
                    except(UnicodeDecodeError):
                        msg.message = message_charset + unicode(part.get_payload(decode=False), "iso-8859-1")

Is there a better, more robust way?

Thanks!

+1  A: 

You could ask the excellent chardet library to guess the encoding.

"Character encoding auto-detection in Python 2 and 3. As smart as your browser. Open source."

RichieHindle
Thanks Richie,for the hint. I guess I will keep testing for the 2-3 most common encodings when importing - if these fail, I will set a flag for the respective mails and offer the option to feed these emails to chardet in the user interface.
Gregor
A: 

I'm sorry but your strategy is WRONG.

Firstly, there are encodings that were deliberately designed to fly under the 7-bit ASCII radar so that they could be used in early email systems. The Chinese HZ encoding is little used these days but Japanese email seems to use ISO-2022-JP quite frequently. Both of those would be wrongly interpreted as ASCII if you tried that first; your current strategy would wrongly interpret them as UTF-8. It would also interpret restricted (all chars < U+0080) UTF-16 text as UTF-8.

Secondly, ISO-8859-1 maps each of all 256 possible bytes to a Unicode character. random_garbage.decode('iso-8859-1') will never raise an exception. In other words, anything that fails the UTF-8 test will be interpreted as 'ISO-8859-1' by your strategy.

Do what the man said: use chardet right from the start. It knows in what order the tests should be done.

John Machin