ansaurus

Question

Answer 1

+1 A:

Not really, no.

The original RFC for Internet Message talks about the in-reply-to header, but doesn't specify the format of the body.

As you've found, different clients add the original text in different ways, implying there's not a standard, coupled with the fact that users will do things differently as well:

Plain text, "rich text", HTML will all have a different way of separating the reply from the original
In Outlook I can choose from the following options when replying to a message:
- Do not include
- Attach original message
- Include original message text
- Include and indent original message text
- Prefix each line of the original message
On top of that, I often send and receive replies that state "Responses in-line" where my comments are intermingled with the original message, so the original message no longer exists in its original form anyway.

Zhaph - Ben Duguid 2010-03-15 13:06:15

Hi, I know there is no official way of doing this, but I am sure using enough Regex coupled with email header parsing, a solution can be found."Don't find fault, find a remedy.""I am looking for a lot of men who have an infinite capacity to not know what can't be done."- Henry Ford x2

Theofanis Pantelides 2010-03-15 17:32:22

Answer 2

+1 A:

There isn't a standardized way, but a sensible heuristic will get you a good distance.

Some algorithms classify lines based on their initial character(s) and by comparing the text to a corpus of marked up text, yielding a statistical probability for each line that it is a) part of the same block as the next/previous one and b) quoted text, a signature, new text, etc.

It'd be worth trying out some of the most popular e-mail clients and creating and comparing some sample messages to see what the differences are. Usenet newsgroups may also help you build a reasonable corpus of messages to work from. HTML e-mail adds an extra level of complexity of course, tthough most compliant mail clients will included the corresponding plain text as well. Different languages also cause issues, as clients which can parse "Paul wrote:" may fall over at "Pablo ha scritto:".

El Zorko 2010-03-19 21:14:45

Not necessarily true, because Paul Wrote: usually has a date and an <[email protected]> which is language independant.

Theofanis Pantelides 2010-03-22 11:38:45

I was reffering to the last comment, the first link is rather helpful, and someone needs to get the bounty

Theofanis Pantelides 2010-03-22 12:54:32

Answer 3

+1 A:

Some heuristics you can try are

-Any number of > characters -Looking for "wrote: " (be very careful with this one)

Also you can try relating the Message ID field with the In Reply To field

And finally, if you cannot find a good library to do this, it is time to start this project. No more parsing emails the Cthulhu way :)

Midhat 2010-03-21 20:29:38

I agree, that someone should have already done this.

Theofanis Pantelides 2010-03-22 10:09:32

Answer 4

A:

I was thinking:

public String cleanMsgBody(String oBody, out Boolean isReply) 
{
    isReply = false;

    Regex rx1 = new Regex("\n-----");
    Regex rx2 = new Regex("\n([^\n]+):([ \t\r\n\v\f]+)>");
    Regex rx3 = new Regex("([0-9]+)/([0-9]+)/([0-9]+)([^\n]+)<([^\n]+)>");

    String txtBody = oBody;

    while (txtBody.Contains("\n\n")) txtBody = txtBody.Replace("\n\n", "\n");
    while (new Regex("\n ").IsMatch(txtBody)) txtBody = (new Regex("\n ")).Replace(txtBody, "\n");
    while (txtBody.Contains("  ")) txtBody = txtBody.Replace("  ", " ");

    if (isReply = (isReply || rx1.IsMatch(txtBody)))
        txtBody = rx1.Split(txtBody)[0]; // Maybe a loop through would be better
    if (isReply = (isReply || rx2.IsMatch(txtBody)))
        txtBody = rx2.Split(txtBody)[0]; // Maybe a loop through would be better
    if (isReply = (isReply || rx3.IsMatch(txtBody))) 
        txtBody = rx3.Split(txtBody)[0]; // Maybe a loop through would be better

    return txtBody;
}

Theofanis Pantelides 2010-03-22 10:07:41

Of course, as a place to start, not as a complete solution.

Theofanis Pantelides 2010-03-22 11:31:22

ansaurus

tags:

views:

answers:

parsing email text reply/forward

related questions