ansaurus

Question

Answer 1

+1 A:

You could ask the excellent chardet library to guess the encoding.

"Character encoding auto-detection in Python 2 and 3. As smart as your browser. Open source."

RichieHindle 2010-08-20 16:04:51

Thanks Richie,for the hint. I guess I will keep testing for the 2-3 most common encodings when importing - if these fail, I will set a flag for the respective mails and offer the option to feed these emails to chardet in the user interface.

Gregor 2010-08-24 10:04:20

Answer 2

A:

I'm sorry but your strategy is WRONG.

Firstly, there are encodings that were deliberately designed to fly under the 7-bit ASCII radar so that they could be used in early email systems. The Chinese HZ encoding is little used these days but Japanese email seems to use ISO-2022-JP quite frequently. Both of those would be wrongly interpreted as ASCII if you tried that first; your current strategy would wrongly interpret them as UTF-8. It would also interpret restricted (all chars < U+0080) UTF-16 text as UTF-8.

Secondly, ISO-8859-1 maps each of all 256 possible bytes to a Unicode character. random_garbage.decode('iso-8859-1') will never raise an exception. In other words, anything that fails the UTF-8 test will be interpreted as 'ISO-8859-1' by your strategy.

Do what the man said: use chardet right from the start. It knows in what order the tests should be done.

John Machin 2010-08-27 11:54:00

ansaurus

tags:

views:

answers:

Test for email charsets with python

related questions