views:

85

answers:

4

Before you guys go telling me that Regex is the epitome of all evil... I already know. If I had more hair it would be ripped out already.

So onto the question. I have made a parser using regex that strips out the desired parts of an html email. Why on earth would I want to do that? Because I'm still a beginner programmer ok, if you can suggest a better way then by all means... do. The parser works perfectly on normal html parts of an email, however if someone sends me and email with just one attachment (or more)...

ALL HELL BREAKS LOOSE!

Instead of getting what a normal html email looks like, I get the plain text version with the html version concatenated onto the end like so:

--_1b4078c9-04f5-4cca-a220-e5b30eddef46_
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable


To: ****@****=3B ****@**** | Emmanuel Smith=3B=
 Jonny Barnes
cc: |bcc: |Ref: Test123
---

Lorem ipsum dolor sit amet=2C consectetur adipiscing elit. Praesent in augu=
e nec justo tempor varius eu et tellus. Nunc id massa tortor=2C ut lobortis=
 sem. Class aptent taciti sociosqu ad litora torquent per conubia nostra=2C=
 per inceptos himenaeos. Maecenas quis nisl nec quam tristique posuere sed =
at nibh. Cras fringilla vestibulum metus vel porttitor. 2 + 2 =3D 7 Cras ia=
culis=2C erat nec gravida accumsan=2C metus felis vestibulum risus=2C quis =
venenatis nisl nulla sed diam. Aenean quis viverra velit. Etiam quis massa =
lectus=2C faucibus facilisis sem. Curabitur non eros tellus. Sed at ligula =
neque. Donec elementum rhoncus volutpat. Curabitur eu accumsan erat. Phasel=
lus auctor odio dolor=2C ut ornare augue. Suspendisse vel est nibh. Vivamus=
 facilisis placerat augue sit amet aliquam. Maecenas viverra=2C ipsum a tin=
cidunt elementum=2C arcu tellus rutrum ipsum=2C et dignissim urna orci ac m=
i. Vivamus non odio massa. Nulla congue massa eu leo pretium non consequat =
urna molestie.



Integer neque odio=2C scelerisque at molestie quis=2C congue sed arcu. Prae=
sent a arcu odio. Donec sollicitudin=2C quam vel tincidunt lobortis=2C urna=
 augue cursus lorem=2C in eleifend nunc risus nec neque. Donec euismod maur=
is non nibh blandit sollicitudin. Vivamus sed tincidunt augue. Suspendisse =
iaculis massa ut tellus rutrum auctor. Cras venenatis consequat urna in viv=
erra. Ut blandit imperdiet dolor non scelerisque. Suspendisse potenti. Sed =
vitae lacus ac odio euismod tempus. Aenean ut sem odio. Curabitur auctor pu=
rus a diam iaculis facilisis. Integer molestie commodo mauris a imperdiet. =
Nunc aliquet tempus orci sit amet viverra.

                     =20
Hotmail is redefining busy with tools for the New Busy. Get more from your =
inbox. See how.                      =20
_________________________________________________________________
The New Busy is not the old busy. Search=2C chat and e-mail from your inbox=
..
http://www.windowslive.com/campaign/thenewbusy?ocid=3DPID28326::T:WLMTAGL:O=
N:WL:en-US:WM_HMP:042010_3=

--_1b4078c9-04f5-4cca-a220-e5b30eddef46_
Content-Type: text/html; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<html>
<head>
<style><!--
..hmmessage P
{
margin:0px=3B
padding:0px
}
body.hmmessage
{
font-size: 10pt=3B
font-family:Verdana
}
--></style>
</head>
<body class=3D'hmmessage'>
To: ****@**** ****@**** | Emmanuel Smith=3B=
 Jonny Barnes<br><div>cc: |</div><div>bcc: |</div><div>Ref: Test123</div><d=
iv><br><span class=3D"ecxecxecxecxecxecxecxecxecxecxecxecxecxecxecxecxecxec=
xecxApple-style-span" style=3D"font-family:Tahoma=2C Verdana=2C Arial=2C sa=
ns-serif=3Bcolor:rgb(68=2C 68=2C 68)"><font class=3D"ecxecxecxecxecxecxecxe=
cxecxecxecxecxecxecxecxApple-style-span" color=3D"#000000"><font class=3D"e=
cxecxecxecxecxecxApple-style-span" face=3D"Verdana">---<br></font></font><d=
iv><font class=3D"ecxecxecxecxecxecxApple-style-span" face=3D"Verdana"><br>=
</font></div><div><span class=3D"ecxecxecxecxecxecxecxecxecxecxecxecxecxecx=
ecxecxecxecxecxecxecxecxecxecxecxecxecxecxApple-style-span" style=3D"font-s=
ize:11px=3Bline-height:14px"><font class=3D"ecxecxecxecxecxecxApple-style-s=
pan" face=3D"Verdana">Lorem ipsum dolor sit amet=2C consectetur adipiscing =
elit. Praesent in augue nec justo tempor varius eu et tellus. Nunc id massa=
 tortor=2C ut lobortis sem. Class aptent taciti sociosqu ad litora torquent=
 per conubia nostra=2C per inceptos himenaeos. Maecenas quis nisl nec quam =
tristique posuere sed at nibh. Cras fringilla vestibulum metus vel porttito=
r. 2 + 2 =3D 7 Cras iaculis=2C erat nec gravida accumsan=2C metus felis ves=
tibulum risus=2C quis venenatis nisl nulla sed diam. Aenean quis viverra ve=
lit. Etiam quis massa lectus=2C faucibus facilisis sem. Curabitur non eros =
tellus. Sed at ligula neque. Donec elementum rhoncus volutpat. Curabitur eu=
 accumsan erat. Phasellus auctor odio dolor=2C ut ornare augue. Suspendisse=
 vel est nibh. Vivamus facilisis placerat augue sit amet aliquam. Maecenas =
viverra=2C ipsum a tincidunt elementum=2C arcu tellus rutrum ipsum=2C et di=
gnissim urna orci ac mi. Vivamus non odio massa. Nulla congue massa eu leo =
pretium non consequat urna molestie.</font></span></div><div><span class=3D=
"ecxecxecxecxecxecxecxecxecxecxecxecxecxecxecxecxecxecxecxecxecxecxecxecxec=
xecxecxecxApple-style-span" style=3D"font-size:11px=3Bline-height:14px"><fo=
nt class=3D"ecxecxecxecxecxecxApple-style-span" face=3D"Verdana"><br></font=
></span></div><div><span class=3D"ecxecxecxecxecxecxecxecxecxecxecxecxecxec=
xecxecxecxecxecxecxecxecxecxecxecxecxecxecxApple-style-span" style=3D"font-=
size:11px=3Bline-height:14px"><font class=3D"ecxecxecxecxecxecxApple-style-=
span" face=3D"Verdana"><br></font></span></div><div><span class=3D"ecxecxec=
xecxecxecxecxecxecxecxecxecxecxecxecxecxecxecxecxecxecxecxecxecxecxecxecxec=
xApple-style-span" style=3D"font-size:11px=3Bline-height:14px"><font class=
=3D"ecxecxecxecxecxecxApple-style-span" face=3D"Verdana"><br></font></span>=
</div><div><font class=3D"Apple-style-span" face=3D"Verdana" size=3D"3"><sp=
an class=3D"Apple-style-span" style=3D"font-size: 11px=3B line-height: 14px=
=3B"><br></span></font></div><span class=3D"ecxecxecxecxecxecxecxecxecxecxe=
cxecxecxecxecxecxecxecxecxecxecxecxecxecxecxecxecxecxecxecxecxecxecxecxecxe=
cxecxecxecxecxApple-style-span" style=3D"font-family:Arial=2C Helvetica=2C =
sans=3Bfont-size:11px"><p style=3D"margin-right:0px=3Bmargin-bottom:14px=3B=
margin-left:0px=3Btext-align:justify=3Bfont-size:11px=3Bline-height:14px=3B=
padding-top:0px=3Bpadding-right:0px=3Bpadding-bottom:0px=3Bpadding-left:0px=
"><font class=3D"ecxecxecxecxecxecxApple-style-span" face=3D"Verdana">Integ=
er neque odio=2C scelerisque at molestie quis=2C congue sed arcu. Praesent =
a arcu odio. Donec sollicitudin=2C quam vel tincidunt lobortis=2C urna augu=
e cursus lorem=2C in eleifend nunc risus nec neque. Donec euismod mauris no=
n nibh blandit sollicitudin. Vivamus sed tincidunt augue. Suspendisse iacul=
is massa ut tellus rutrum auctor. Cras venenatis consequat urna in viverra.=
 Ut blandit imperdiet dolor non scelerisque. Suspendisse potenti. Sed vitae=
 lacus ac odio euismod tempus. Aenean ut sem odio. Curabitur auctor purus a=
 diam iaculis facilisis. Integer molestie commodo mauris a imperdiet. Nunc =
aliquet tempus orci sit amet viverra.</font></p><p style=3D"margin-right:0p=
x=3Bmargin-bottom:14px=3Bmargin-left:0px=3Btext-align:justify=3Bfont-size:1=
1px=3Bline-height:14px=3Bpadding-top:0px=3Bpadding-right:0px=3Bpadding-bott=
om:0px=3Bpadding-left:0px"><font class=3D"ecxecxecxecxecxecxApple-style-spa=
n" face=3D"Verdana"><br></font></p><p style=3D"margin-right:0px=3Bmargin-bo=
ttom:14px=3Bmargin-left:0px=3Btext-align:justify=3Bfont-size:11px=3Bline-he=
ight:14px=3Bpadding-top:0px=3Bpadding-right:0px=3Bpadding-bottom:0px=3Bpadd=
ing-left:0px"><font class=3D"Apple-style-span" face=3D"Verdana"><br></font>=
</p></span></span></div>                      <br><hr>Hotmail is redefining busy with=
 tools for the New Busy. Get more from your inbox. <a href=3D"http://www.wi=
ndowslive.com/campaign/thenewbusy?ocid=3DPID28326::T:WLMTAGL:ON:WL:en-US:WM=
_HMP:042010_2">See how.</a>                       <br /><hr />The New Busy is not the =
old busy. Search=2C chat and e-mail from your inbox. <a href=3D'http://www.=
windowslive.com/campaign/thenewbusy?ocid=3DPID28326::T:WLMTAGL:ON:WL:en-US:=
WM_HMP:042010_3' target=3D'_new'>Get started.</a></body>
</html>=

--_1b4078c9-04f5-4cca-a220-e5b30eddef46_--

So my question is... How can I separate the html version from the text version using regex (or by easier means)?

+3  A: 

This get's a little complicated, but you can google MIME email structures if you want to get into the nitty gritty. I'm going to try to explain the email structure a little bit rather than attempt to answer with a specific RegEx (mainly because I'm not sure a RegEx will do what you want).

If you look at your raw emails you will see:

--1b4078c9-04f5-4cca-a220-e5b30eddef46

This is the MIME boundary, it separates the individual pieces of a MIME email message. A MIME email message can contain many parts, including an HTML version of the email, a plain text version, as well as file or image attachments. If you look at the two lines following the boundary, they explain what the upcoming part is by using it's MIME type.

If you look at the top of the raw email, you will see the 'Content-Type' header, which in a multi-part MIME message should be followed by a 'boundary=' section. You can take that boundary (as shown above) and use that to break up the pieces of your email.

What I think makes it difficult to do via a RegEx is that the boundary will be different for every email, so it's something that is more applicable to some code. You might want to use a RegEx to find the boundary, and then some logic to break the message up, maybe something like:

myMessage.Split(myBoundary)
Coding Gorilla
A: 

It seems to to me that the HTML part of your email begins shortly after Content-Type: text/html;, so I'd say those few lines would be a good token to indicate that the HTML is begining. As for a regex, I think this would do (.+)Content-Type: text/html; charset="iso-8859-1"(.+). The text part of your input will be in capture group 1, and the HTML part in capture group 2. You have to be able to set an option so that . matches \n as well as other characters.

FrustratedWithFormsDesigner
A: 

New to this site, just polking around for a few minutes before going to back work.

This thing here: "--1b4078c9-04f5-4cca-a220-e5b30eddef46" is a string declared as a MIME part separator in the headers of the incoming email message. Look for it and save it.

Looks like each MIME part delimited by that string breaks down into two sections: a list of name-value values and then the "real content" of the section, separated by a blank line. I'm pretty sure there's some hairy MIME standard that says what I just said. :) In the first ("attributes") section, look for the content-type you want (text/html). Once you've found it, look for the next blank line, and once you've found THAT, suck in the content until the next MIME part delimiter. That'll be your html email message, which you can then process.

You will not be able to do all this with with one magnificent regex, I believe. You'll have to code a loop over each line of the incoming message, and do a state machine sorta thing. States will be something like:

(1) Mime part delimiter unknown (2) Mime part delimiter known, mime part unseen (3) Mime part seen, type not html (4) Mime part seen, type is html, blank line not seen (5) Blank line seen, mime part delimiter unseen (6) Mime part delimiter seen (end of processing)

I'm being very fast , loose, sloppy. Hope this helps.

John.

JohnL4
+1  A: 

There are a few open source C# MIME parsers available:

The last two are a bit old. If they don't easily compile, their source might point you in the right direction.

Remember, an email can contain an attachment that is an email that contains an attachment, etc, etc... At some point, Regex will let you down.

Corbin March
I think you may have rendered a whole day of my programming useless :/
Immanu'el Smith