I want to import emails from an mbox format into a Django app. All database tables are Unicode. My problem: sometimes the wrong charset is given, sometimes none at all. What is the best way to deal with these encoding issues?
So far, I merely nest exceptions to try the two most common charsets I receive mails in (utf-8 and iso-8859-1):
if (not message.is_multipart()):
message_charset = message.get_content_charset()
msg.message = message_charset + unicode(message.get_payload(decode=False), message_charset)
else:
for part in message.walk():
if part.get_content_type() == "text/plain":
message_charset = part.get_content_charset()
try:
msg.message = message_charset + unicode(part.get_payload(decode=False), message_charset)
except(UnicodeDecodeError):
try:
msg.message = message_charset + unicode(part.get_payload(decode=False), "utf-8")
except(UnicodeDecodeError):
msg.message = message_charset + unicode(part.get_payload(decode=False), "iso-8859-1")
Is there a better, more robust way?
Thanks!