views:

807

answers:

5

I have a little python script that pulls emails from a POP mail address and dumps them into a file (one file one email)

Then a PHP script runs through the files and displays them.

I am having an issue with ISO-8859-1 (Latin-1) encoded email

Here's an example of the text i get: =?iso-8859-1?Q?G=EDsli_Karlsson?= and Sj=E1um hva=F0 =F3li er kl=E1r J

The way i pull emails is this code.

pop = poplib.POP3(server)

mail_list = pop.list()[1]

for m in mail_list:
    mno, size = m.split()
    lines = pop.retr(mno)[1]

    file = StringIO.StringIO("\r\n".join(lines))
    msg = rfc822.Message(file)

    body = file.readlines()

    f = open(str(random.randint(1,100)) + ".email", "w")
    f.write(msg["From"] + "\n")
    f.write(msg["Subject"] + "\n")
    f.write(msg["Date"] + "\n")

    for b in body:
        f.write(b)

I have tried probably all combinations of encode / decode within python and php.

A: 

That's MIME content, and that's how the email actually looks like, not a bug somewhere. You have to use a MIME decoding library (or decode it yourself manually) on the PHP side of things (which, if I understood correctly, is the one acting as email renderer).

In Python you'd use mimetools. In PHP, I'm not sure. It seems the Zend framework has a MIME parser somewhere, and there are probably zillions of snippets floating around.

http://en.wikipedia.org/wiki/MIME#Encoded-Word

Vinko Vrsalovic
+2  A: 

Until very recently, plain Latin-N or utf-N were no allowed in headers which means that they would get to be encoded by a method described at first in RFC-1522 but it has been superseded later. Accents are encoded either in quoted-printable or in Base64 and it is indicated by the ?Q? (or ?B? for Base64). You'll have to decode them. Oh and space is encoded as "_". See Wikipedia.

Keltia
Well, plain UTF-8 (*not* Latin-1) was authorized in RFC 5335 "Internationalized Email Headers". But it has the status Experimental and is not widely deployed.The current standard is RFC 2047 "MIME (Multipurpose Internet Mail Extensions) Part Three: Message Header Extensions for Non-ASCII Text"
bortzmeyer
Thanks, I was sure you were going to chime in with the references I needed :)
Keltia
+2  A: 

You can use the python email library (python 2.5+) to avoid these problems:

import email
import poplib
import random
from cStringIO import StringIO
from email.generator import Generator

pop = poplib.POP3(server)

mail_count = len(pop.list()[1])

for message_num in xrange(mail_count):
    message = "\r\n".join(pop.retr(message_num)[1])
    message = email.message_from_string(message)

    out_file = StringIO()
    message_gen = Generator(out_file, mangle_from_=False, maxheaderlen=60)
    message_gen.flatten(message)
    message_text = out_file.getvalue()

    filename = "%s.email" % random.randint(1,100)
    email_file = open(filename, "w")
    email_file.write(message_text)
    email_file.close()

This code will get all the messages from your server and turn them into Python message objects then flatten them out into strings again for writing to the file. By using the email package from the Python standard library MIME encoding and decoding issues should be handled for you.

DISCLAIMER: I have not tested that code, but it should work just fine.

mcrute
I used a combination of your reply and other things i found. I have added my code as an answer.
Ólafur Waage
A: 

That's the MIME encoding of headers, RFC 2047. Here is how to decode it in Python:

import email.Header
import sys

header_and_encoding = email.Header.decode_header(sys.stdin.readline())
for part in header_and_encoding:
    if part[1] is None:
        print part[0],
    else:
        upart = (part[0]).decode(part[1])
        print upart.encode('latin-1'),
print

More detailed explanations (in French) in http://www.bortzmeyer.org/decoder-en-tetes-courrier.html

bortzmeyer
+1  A: 

There is a better way to do this, but this is what i ended up with. Thanks for your help guys.

import poplib, quopri
import random, md5
import sys, rfc822, StringIO
import email
from email.Generator import Generator

user = "[email protected]"
password = "password"
server = "mail.example.com"

# connects
try:
 pop = poplib.POP3(server)
except:
 print "Error connecting to server"
 sys.exit(-1)

# user auth
try:
 print pop.user(user)
 print pop.pass_(password)
except:
 print "Authentication error"
 sys.exit(-2)

# gets the mail list
mail_list = pop.list()[1]

for m in mail_list:
 mno, size = m.split()
 message = "\r\n".join(pop.retr(mno)[1])
 message = email.message_from_string(message)

 # uses the email flatten
 out_file = StringIO.StringIO()
 message_gen = Generator(out_file, mangle_from_=False, maxheaderlen=60)
 message_gen.flatten(message)
 message_text = out_file.getvalue()

 # fixes mime encoding issues (for display within html)
 clean_text = quopri.decodestring(message_text)

 msg = email.message_from_string(clean_text)

 # finds the last body (when in mime multipart, html is the last one)
 for part in msg.walk():
  if part.get_content_type():
   body = part.get_payload(decode=True)

 filename = "%s.email" % random.randint(1,100)

 email_file = open(filename, "w")

 email_file.write(msg["From"] + "\n")
 email_file.write(msg["Return-Path"] + "\n")
 email_file.write(msg["Subject"] + "\n")
 email_file.write(msg["Date"] + "\n")
 email_file.write(body)

 email_file.close()

pop.quit()
sys.exit()
Ólafur Waage