tags:

views:

246

answers:

4

I'm writing some code to parse forwarded emails. What I'm not sure is if maybe there is some Python library, some RFC I could stick to or some other resource that would allow me to automatize the task.

To be precise, I don't know if "layout" of forwarded email is covered by some standard or recommendation, or it has just evolved over the years so now most email clients produce similar output for the text part:

    Begin forwarded message: 

    > From: Me <[email protected]>
    > Date: January 30, 2010 18:26:33 PM GMT+02:00
    > To: Other Me <[email protected]>
    > Subject: Unwise question

-- and go wild for attachements and whatever other MIME sections can be there.

If it's still not precise enough I'll clarify it, it's just that I'm not 100% sure what to ask about (RFC, Python lib, convention or something else).

+1  A: 

Standard for a reply/forward is > prepending each line the number of times the mail is nested including who sent the initial e-mail is up to the client to sort out. So what you need to do in python is simply add > to the start of each line.

imap Test <[email protected]> Wrote:
>
>twice
>imap Test wrote:
>> nested
>>
>> [email protected] wrote:
>>> test
>>>
>>> -- 
>>> Message sent via AHEM.
>>>   
>>
>

Attachments just simply need to be attached to the message or as you put it 'go wild.'

I am not familiar with python, but believe the code would be:

string = string.replace("\n","\n>")
Gazler
Thanks, but it seems I didn't out the word: `parse` in my question :) It's fixed now.
Tomasz Zielinski
Ahh, well that totally changes the question. :) The answer still stands though, there is no standard, it is entirely up to the client.
Gazler
@Gazler: Look out in the real world and see if there are other programs that make the correct inferences despite the lack of a standard. For example, if you subscribe to tripit.com and forward your airline itinerary to [email protected], your account will display the correct travel details. Tripit.com, at least, can do the right thing. How can the OP approximate this behavior?
hughdbrown
tripit.com I assume uses a lot of sweat and elbow grease. There's a few hundred travel providers, times a few dozen email programs. If you have the people then write the parsers for each one. Make sure it's a stringent parser, and when something comes in which isn't formatted right, write a new parser to handle that change. There's probably 6 months or more of work to get right.
Andrew Dalke
+1  A: 

In my experience just about ever email client forwards/replies differently. Typically you'll have a plain text version and a html encoded version in the mime at the bottom of the mail pack. Mail headers do have a RFC (http://www.faqs.org/rfcs/rfc2822.html "2822"), but unfortunately the content of the message body is out side the scope.

Not only do you have to contend with the mail client variance, but the variance of user preferences. As an example: Lotus Notes puts replies at the top and Thunderbird replies at the bottom. So when a Thunderbird user is replying to a Lotus Notes user's reply they might insert their reply at the top and leave their signature at the bottom.

Another pitfall maybe contending with word wrapping of replied chains.

>>>> The outer reply that goes over the limit and is word wraped by
the middle replier's mail client\n
>> The message body of a middle reply
> Previous reply
Newest reply

I wouldn't parse the message and leave it to the user to parse in their heads. Or, I'd borrow the code from another project.

ryan v
Thanks. Fortunately I can put some constraints on the incoming forwarded emails. Anyway, it's a pity that there is no "Best Practices Code" for this.
Tomasz Zielinski
The RFC for mail headers is now RFC 5322.
bortzmeyer
+1  A: 

As the other answers already indicate: there is no standard, and your program is not going to be flawless.

You could have a look at the headers, in particular the User-Agent header, to see what kind of client was used, and code specifically for the most common clients.

To find out what clients you should consider to support, have a look at this popularity study. Various Outlooks, Yahoo!, Hotmail, Mail.app, iPhone mail, Gmail and Lotus Notes rank highly. About 11% of the mail is classified as "undetectable", but using headers from the forwarded e-mail you might be able to do better than that. Note that the statistics were gathered by placing an image inside the e-mail, so results may be skewed.

Another problem is HTML mail, which may or may not include a plain-text version. I'm not sure about clients' usual behaviour in this respect.

Thomas
As I've written in the other comment, I can put some constraints on users of my script (e.g. make them use only one supported email client), but the HTML part can still be tricky as the original incoming emails can contain pretty much anything.
Tomasz Zielinski
If the client wraps the original in a `<div>` or something, when it forwards it, then HTML might actually be the easy part.
Thomas
+2  A: 

Unlike what many other people said, there is a standard on forwarded emails, RFC 2046, "Multipurpose Internet Mail Extensions (MIME) Part Two: Media Types", more than ten years old. See specially its section 5.2, "Message Media Type".

The basic idea behind RFC 2046 is to encapsulate one message into the MIME part of another, of type named (unfortunately) message/rfc822 (never forget that MIME is recursive). The MIME library of Python can handle it fine.

I did now downvote the other answers because they are right in one respect: the standard is not followed by every mailer. For instance, the mutt mailer can forward a message in RFC 2046 format but also in a adhoc format. So, in practice, a mailer probably cannot handle only RFC 2046, it also has to parse the various others and underspecified syntaxes.

bortzmeyer