ansaurus

Question

Parse email content with Regular Expressions

Answer 1

+2 A:

Read Mastering Regular Expressions. It will teach you everything you need to know to complete this and other similar regex problems, and will give you enough understanding and insight to get you started writing much more complicated regular expressions.

Kibbee 2008-12-13 17:52:52

Thanks for your quick comment! I will take a look at this book shortly but in the meantime I would need to have it done as soon as possible.Can you please give me an advise about how to implement it?Cheers,--Pablo

2008-12-13 19:35:04

Answer 2

+1 A:

If the emails are in the same format always, you can do this a number of different ways. A simple way of doing it would be to split on the newline and take a substring on each line, starting after the label.

With regexes, you'd probably create a regex that creates a number of named captures. You can then index into the Groups property of the match on the name of each named group in order to get the value out of it. This is a little more complex, of course.

Will 2008-12-13 19:40:38

The substring/IndexOf() way would also be faster than building a complex regex.

Tomalak 2008-12-14 10:34:35

Answer 3

A:

We found that for spam filtering and other high-volume applications, regular expressions are a bit slow for parsing MIME headers, which is what you want to do. The code is somewhat specialized, but I wrote a C state machine for doing the parsing which is as fast as you'll get without going to something like re2c. The code is not for the faint of heart, but it is blindingly fast.

For emails I think you'll find an explicit state machine is easier to work with than regular expressions. It's also the last refuge of the goto statement!

Norman Ramsey 2008-12-13 20:06:04

Answer 4

+2 A:

Assuming that the parts in your email that are not bold always occur like that in all your emails, you can easily grab all the parts from your email with the regex:

Sig\./Sig\.ra :(.*)

Email: (.*)

Tel\.: (.*)

sta cercando un immobile con le seguenti caratteristiche:

Categoria: (.*)

Tipologia: (.*)

Tipo di contratto: (.*)

Comune: (.*)

Zona: (.*)

Fascia di prezzo: (.*)

In C#

Regex regexObj = new Regex(@"Sig\./Sig\.ra :(.*)

Email: (.*)

Tel\.: (.*)

sta cercando un immobile con le seguenti caratteristiche:

Categoria: (.*)

Tipologia: (.*)

Tipo di contratto: (.*)

Comune: (.*)

Zona: (.*)

Fascia di prezzo: (.*)");
Match matchObj = regexObj.Match(subjectString);
string Sig = matchObj.Groups[1].Value;
string Email = matchObj.Groups[2].Value;
// and so on to get all the other parts

Jan Goyvaerts 2008-12-14 13:26:08

Answer 5

A:

You really don't want to do this manually, or with regular expressions. There are many different ways to encode data in an email, and many emails that don't strictly conform to the spec that can still be parsed. I have had success with AnPOP in a .NET environment.

Chase Seibert 2008-12-15 02:22:45

Answer 6

+1 A:

i think it will be much better to split this string into an array of lines you can initialize a dictionary with all the titles as keys and you will search each line for the Title from the dictionary ("Email:" for example) and then u put the the result back into the into a dictionary as value at the end you will have a dictionary with all the titles and values. i think you dont need a regex for that. actually that way the order of the titles wont matter.

Karim 2009-11-01 19:10:50

ansaurus

tags:

views:

answers:

Parse email content with Regular Expressions

related questions