views:

1780

answers:

4

Hi all,

I'm working on a web application that parses and displays email messages in a threaded format (among other things). Emails may come from any number of different mail clients, and in either text or HTML format.

Given that most people have a tendency to top post, I'd like to be able to hide the duplicated message in an email reply in a manner similar to how Gmail does it (e.g. "show quoted text").

Determining which part of the message is the reply is somewhat challenging. Personally, I use "> " delimiters at the beginning of the quoted text when replying. I created a regexp that looks for these lines and wraps a div around them to allow some JS to hide or show this block of text.

I then noticed that Outlook doesn't use the "> " characters by default, it simply adds a header block above the reply with the summary of the headers (From, Subject, Date, etc.). The reply is untouched. I can match on this and hide the rest of the email, working with the assumption that it's a top quote.

I then looked at Thunderbird, and it uses "> " for text, and <blockquote> for HTML mails. I still haven't looked at what Apple Mail does, what Notes does, or what any of the other millions of mail clients out there do.

Will I be writing a special case regexp for every single client out there? or is there something I'm missing?

Any suggestions, sample code or pointers to third party libraries much appreciated!

A: 

First thing I think I'd do is strip out all the white space, or reduce white space to 1 between each word, and special characters from both blocks, then look for the old one in the new one.

RKitson
I don't think it's a diff problem, more of a pattern match. I updated the description to better describe what I'm trying to accomplish.
Darren
+1  A: 

From what I can tell, gmail does not bother about prefixed lines or section headings, except to ignore them. If the text lines appeared earlier in the thread, and then reappear, it is considered to be quoted. Thus, e.g., if you send multiple messages and don't change your signature, the signature is considered to be quoted. If you've already dealt with the '>' prefix, a simple diff should do most of the rest. No need to get fancy.

Zac Thompson
I updated the description to describe the problem better. For the case of mail clients that insert a '> ' prefix, it's pretty simple, but for other mail clients it's problematic.
Darren
+4  A: 

It'll be pretty hard to duplicate the way gmail does it since it doesn't care about whether it was a quoted piece or not, like Zac says, it just seems to care about the diff.

Its actually pretty hard to get this right 100% of the time. Plain text email is "lossy", its entirely possible for you to send

> Here is my long line that is over 74 chars (email line length limit)

Which can get encoded as something like

> Here is my long line that is over 74 chars (email=
 line length limit)

And then is decoded as

> Here is my long line that is over 74 chars (email
line length limit)

Making it indistinguishable from an inline-reply.

This is email, so variations are abound. Email usually line-wraps at something like 74 characters, and encoding schemes can differ. Its a real PITA. If you can access the HTML version, you will probably have better luck looking for quote tags and the like. Another idea would be to parse both the plain text and html version to try and determine the boundries.

Additionally, its best to just plan for specific client hacks. They all construct mime messages differently, both in structure and header content.

Edit: I say this with the experience of writing an email processing system as well as seeing several people try to do the -exact- thing you're doing. It always only got "ok" results.

Richard Levasseur
And lets not forget that even the system employed by Gmail is far from perfect. I remember that once a friend of mine replied to an email changing a few sparse words to the text I had sent, and Gmail failed to notice some of them. I was shocked. But maybe it has been fixed by now.
UncleZeiv
A: 

Hi

I am also having a similar issue.I need to extract the original mail from all replied mails. Can you please post the Regular expressions which matches the only quoted text in the original mail (for different mail clients).

subbi
Please ask this as a question, click "Add Question" rather than as a response to someone else's question.
Tim McNamara