views:

237

answers:

4

Hi,

I am creating a web based email client using c# asp.net.

What is confusing is that various email clients seem to add the original text in alot of different ways when replying by email.

What I was wondering is that, if there is some sort of standardized way, to disambiguate this process?

Thank you -Theo

+1  A: 

Not really, no.

The original RFC for Internet Message talks about the in-reply-to header, but doesn't specify the format of the body.

As you've found, different clients add the original text in different ways, implying there's not a standard, coupled with the fact that users will do things differently as well:

  • Plain text, "rich text", HTML will all have a different way of separating the reply from the original
  • In Outlook I can choose from the following options when replying to a message:
    • Do not include
    • Attach original message
    • Include original message text
    • Include and indent original message text
    • Prefix each line of the original message
  • On top of that, I often send and receive replies that state "Responses in-line" where my comments are intermingled with the original message, so the original message no longer exists in its original form anyway.
Zhaph - Ben Duguid
Hi, I know there is no official way of doing this, but I am sure using enough Regex coupled with email header parsing, a solution can be found."Don't find fault, find a remedy.""I am looking for a lot of men who have an infinite capacity to not know what can't be done."- Henry Ford x2
Theofanis Pantelides
+1  A: 

There isn't a standardized way, but a sensible heuristic will get you a good distance.

Some algorithms classify lines based on their initial character(s) and by comparing the text to a corpus of marked up text, yielding a statistical probability for each line that it is a) part of the same block as the next/previous one and b) quoted text, a signature, new text, etc.

It'd be worth trying out some of the most popular e-mail clients and creating and comparing some sample messages to see what the differences are. Usenet newsgroups may also help you build a reasonable corpus of messages to work from. HTML e-mail adds an extra level of complexity of course, tthough most compliant mail clients will included the corresponding plain text as well. Different languages also cause issues, as clients which can parse "Paul wrote:" may fall over at "Pablo ha scritto:".

El Zorko
Not necessarily true, because Paul Wrote: usually has a date and an <[email protected]> which is language independant.
Theofanis Pantelides
I was reffering to the last comment, the first link is rather helpful, and someone needs to get the bounty
Theofanis Pantelides
+1  A: 

Some heuristics you can try are

-Any number of > characters -Looking for "wrote: " (be very careful with this one)

Also you can try relating the Message ID field with the In Reply To field

And finally, if you cannot find a good library to do this, it is time to start this project. No more parsing emails the Cthulhu way :)

Midhat
I agree, that someone should have already done this.
Theofanis Pantelides
A: 

I was thinking:

public String cleanMsgBody(String oBody, out Boolean isReply) 
{
    isReply = false;

    Regex rx1 = new Regex("\n-----");
    Regex rx2 = new Regex("\n([^\n]+):([ \t\r\n\v\f]+)>");
    Regex rx3 = new Regex("([0-9]+)/([0-9]+)/([0-9]+)([^\n]+)<([^\n]+)>");

    String txtBody = oBody;

    while (txtBody.Contains("\n\n")) txtBody = txtBody.Replace("\n\n", "\n");
    while (new Regex("\n ").IsMatch(txtBody)) txtBody = (new Regex("\n ")).Replace(txtBody, "\n");
    while (txtBody.Contains("  ")) txtBody = txtBody.Replace("  ", " ");

    if (isReply = (isReply || rx1.IsMatch(txtBody)))
        txtBody = rx1.Split(txtBody)[0]; // Maybe a loop through would be better
    if (isReply = (isReply || rx2.IsMatch(txtBody)))
        txtBody = rx2.Split(txtBody)[0]; // Maybe a loop through would be better
    if (isReply = (isReply || rx3.IsMatch(txtBody))) 
        txtBody = rx3.Split(txtBody)[0]; // Maybe a loop through would be better

    return txtBody;
}
Theofanis Pantelides
Of course, as a place to start, not as a complete solution.
Theofanis Pantelides