views:

126

answers:

4

Hi,

I'm trying to write application that periodically receives e-mails. It writes every mail into database. But sometimes i'm getting 'Re:' e-mail that looks something like this:

New message

On September 21, 2010 24:26 Someone wrote (a):

| Old message |

The format depends on e-mail provider.

Is there any library that helps removing 'Re' part from e-mail message? Maybe IMAP server can do that? I have all the previous e-mails from thread in database so I can take them and search in new message.

A: 
  1. No IMAP Server will not and does not remove anything
  2. Such library does not exist because there is no standard, every email provider does it differently, gmail etc have developped their own tools
  3. You have to look for pattern, that will somehow begin with headers with recipient as sender, like...
From: <receipent>
From: "NAME" <receipent>
From: receipent

and you have to omit the parts from this line below, howerver only checking this will not be sufficient as usually from is followed by subject,cc,to etc, so the pattern needs to be checked. I think some open source project or text library may exist, but its too difficult to find it on google.

Akash Kava
+1  A: 

Personally I think that you are out of luck here, as the message copy is part of the body. So in order to remove it you will have to process the message's body and write an extraction method for each known format (obviously the problem is that you cannot know all possible formats).

So, instead of parsing the body why don't you persist the whole message into the database? Normally the size of the message should not be the problem with modern DBMS. If it really is a problem you always can compress the body and store it in a BLOB.

Obalix
I disagree, size is not the constraint most of the time but we need to display only the message and not the replies to the view.
Akash Kava
I agree with you that the copied text is just clutter, however, it one will have to make a tradeoff: 1. Developing a filter that will ever only catch part of the clutter and has the danger of also removing relevant content - and thus be cause of the risks will most likely prove costly. - or - 2. Live with the clutter and deliver the project with a much lower risk. -- But as I said, it is a tradeoff!
Obalix
A: 

If you are able to associate a reply (RE:) message with the original/previous message that it is a reply to, then I would think that you could grab the body text of the original/previous message from your database, and then remove that text from the body of the reply. However, this method will not be 100% accurate, because clients could convert an HTML/Rich Text email in to plain text, or vice-versa. In any such case, this method probably wouldn't work. Even so, this technique would be generic and probably work the majority of the time.

In addition, the email provider may add certain header fields, or preambles, to the beginnings of a quoted message in a reply. In this case, I don't think there is any "catch all" solution.

My recommendation would be to target a few of the really huge web-mail providers (Gmail, Yahoo, Microsoft, etc), learn the formats that they use for their replies and parse the messages accordingly. In addition, you could likely handle a few generic formats as well. For instance, the '>' character is commonly used at the beginning of each line of quoted text in a reply.

If you're going to be developing in a language like C#, create yourself an Interface like IReplyFormat, with a corresponding implementation for each provider, and possibly some generic formats.

I don't think you will find any catch-all/perfect solution to this problem, as there are simply too many mail providers with too many different formats. However, I think you can at the very least find some techniques, like the ones mentioned above, that will work for you more times than not, which is the best you can hope for at this point.

Justin Holzer
A: 

I agree with Obalix. It's too hard to filter out replies so must keep the whole message. However, when you present email to the user, you can hide some parts of it. Those part can be shown with an optional "Click here to see the full message" or similar. For instance, regular expression to filter '>' characters would look something like @"^[ \f\t\v>]*"

SlavaGu