tags:

views:

1413

answers:

6

Everyday I receive thousands of emails and I want to parse the content/body of these emails to load them into a database.

My problem is that nowadays I am parsing the email body manually and I would like to change the logic to a Regular Expression in C#.

Here is the body of the emails:


Gentilissima Agenzia Nexity Residenziale

il nostro utente:

Sig./Sig.ra :Pablo Azorin

Email: [email protected]

Tel.: 02322-498900

sta cercando un immobile con le seguenti caratteristiche:

Categoria: Residenziale

Tipologia: Villa

Tipo di contratto: Vendita

Comune: Assago Prov. Milano

Zona: non specificata

Fascia di prezzo: non specificata


I need to extract the text in bold and I thought a RegEx is what I need for this...

Looking forward to get your suggestion about how to make it works.

Thanks!

--Pablo

+2  A: 

Read Mastering Regular Expressions. It will teach you everything you need to know to complete this and other similar regex problems, and will give you enough understanding and insight to get you started writing much more complicated regular expressions.

Kibbee
Thanks for your quick comment! I will take a look at this book shortly but in the meantime I would need to have it done as soon as possible.Can you please give me an advise about how to implement it?Cheers,--Pablo
+1  A: 

If the emails are in the same format always, you can do this a number of different ways. A simple way of doing it would be to split on the newline and take a substring on each line, starting after the label.

With regexes, you'd probably create a regex that creates a number of named captures. You can then index into the Groups property of the match on the name of each named group in order to get the value out of it. This is a little more complex, of course.

Will
The substring/IndexOf() way would also be faster than building a complex regex.
Tomalak
A: 

We found that for spam filtering and other high-volume applications, regular expressions are a bit slow for parsing MIME headers, which is what you want to do. The code is somewhat specialized, but I wrote a C state machine for doing the parsing which is as fast as you'll get without going to something like re2c. The code is not for the faint of heart, but it is blindingly fast.

For emails I think you'll find an explicit state machine is easier to work with than regular expressions. It's also the last refuge of the goto statement!

Norman Ramsey
+2  A: 

Assuming that the parts in your email that are not bold always occur like that in all your emails, you can easily grab all the parts from your email with the regex:

Sig\./Sig\.ra :(.*)

Email: (.*)

Tel\.: (.*)

sta cercando un immobile con le seguenti caratteristiche:

Categoria: (.*)

Tipologia: (.*)

Tipo di contratto: (.*)

Comune: (.*)

Zona: (.*)

Fascia di prezzo: (.*)

In C#

Regex regexObj = new Regex(@"Sig\./Sig\.ra :(.*)

Email: (.*)

Tel\.: (.*)

sta cercando un immobile con le seguenti caratteristiche:

Categoria: (.*)

Tipologia: (.*)

Tipo di contratto: (.*)

Comune: (.*)

Zona: (.*)

Fascia di prezzo: (.*)");
Match matchObj = regexObj.Match(subjectString);
string Sig = matchObj.Groups[1].Value;
string Email = matchObj.Groups[2].Value;
// and so on to get all the other parts
Jan Goyvaerts
A: 

You really don't want to do this manually, or with regular expressions. There are many different ways to encode data in an email, and many emails that don't strictly conform to the spec that can still be parsed. I have had success with AnPOP in a .NET environment.

Chase Seibert
+1  A: 

i think it will be much better to split this string into an array of lines you can initialize a dictionary with all the titles as keys and you will search each line for the Title from the dictionary ("Email:" for example) and then u put the the result back into the into a dictionary as value at the end you will have a dictionary with all the titles and values. i think you dont need a regex for that. actually that way the order of the titles wont matter.

Karim