views:

649

answers:

10

We've got a scenario that requires us to parse lots of e-mail (plain text), each e-mail 'type' is the result of a script being run against various platforms. Some are tab delimited, some are space delimited, some we simply don't know yet.

We'll need to support more 'formats' in the future too.

Do we go for a solution using:

  • Regex
  • Simply string searching (using string.IndexOf etc)
  • Lex/ Yacc
  • Other

The overall solution will be developed in C# 2.0 (hopefully 3.5)

+4  A: 

Regex.

Regex can solve almost everything except for world peace. Well maybe world peace too.

IainMH
I heard Regex was responsible for brining down the Berlin wall.
Robert Durgin
They should indeed stop using Nukes in disaster movies.
Coincoin
Regex: The cause of, and solution to, all of life's problems.
Joviee
A: 

Regex would probably be you bes bet, tried and proven. Plus a regular expression can be compiled.

Robert Durgin
+1  A: 

You probably should have a pluggable system regardless of which type of string parsing you use. So, this system calls upon the right 'plugin' depending on the type of email to parse it.

Vaibhav
A: 

With as little information you provided, i would choose Regex.

But what kind of information you want to parse and what you would want to do will change the decision to Lex/Yacc maybe..

But it looks like you've already made your mind up with String search :)

Prakash
+4  A: 

The three solutions you stated each cover very different needs.

Manual parsing (simple text search) is the most flexible and the most adaptable, however, it very quickly becomes a real pain in the ass as the parsing required is more complicated.

Regex are a middle ground, and probably your best bet here. They are powerful, yet flexible as you can yourself add more logic from the code that call the different regex. The main drawback would be speed here.

Lex/Yacc is really only adapted to very complicated, predictable syntaxes and lacks a lot of post compile flexibility. You can't easily change parser in mid parsing, well actually you can but it's just too heavy and you'd be better using regex instead.

I know this is a cliché answer, it all really comes down to what your exact needs are, but from what you said, I would personally probably go with a bag of regex.

As an alternative, as Vaibhav poionted out, if you have several different situations that can arise and that you cna easily detect which one is coming, you could make a plugin system that chooses the right algorithm, and those algorithms could all be very different, one using Lex/Yacc in pointy cases and the other using IndexOf and regex for simpler cases.

Coincoin
A: 

Your best bet is RegEx because it provides a much greater degree of flexibility than any of the other options.

While you could use IndexOf to handle somethings, you may quickly find yourself writing code that looks like:

if(s.IndexOf("search1")>-1 || s.IndexOf("search2")>-1 ||...

That can be handled in one RegEx statement. Plus, there are a lot of place like RegExLib.com where you can find folks who have shared regular expressions to solve problems.

Josef
+1  A: 

You must architect your solution to be updatable, so that you can handle unknown situations when they crop up. Create an interface for parsers that contains not only methods for parsing the emails and returning results in a standard format, but also for examining the email to determine if the parser will execute.

Within your configuration, identify the type of parser you wish to use, set its configuration options, and the configuration for the identifiers which determine if a parser will act or not. Name the parsers by assembly qualified name so that the types can be instantiated at runtime even if there aren't static links to their assemblies.

Identifiers can implement an interface as well, so you can create different types that check for different things. For instance, you might create a regex identifier, which parses the email for a specific pattern. Make sure to make as much information available to the identifier, so that it can make decisions on things like from addresses as well as the content of the email.

When your known parsers can't handle a job, create a new DLL with types that implement the parser and identifier interfaces that can handle the job and drop them in your bin directory.

Will
+1  A: 

It depends on what you're parsing. For anything beyond what Regex can handle, I've been using ANTLR. Before you jump into recursive descent parsing for the first time, I would research how they work, before attempting to use a framework like this one. If you subscribe to MSDN Magazine, check the Feb 2008 issue where they have an article on writing one from scratch.

Once you get the understanding, learning ANTLR will be a ton easier. There are other frameworks out there, but ANTLR seems to have the most community support and public documentation. The author has also published The Definitive ANTLR Reference: Building Domain-Specific Languages.

spoulson
A: 

@Coincoin has covered the bases; I just want to add that with regex it's particularly easy to end up with hard-to-read, hard-to-maintain code. Regex is a powerful and very compact language, so that's how it often goes.

Using whitespace and comments within the regex can go a long way to make it easier to maintain regexes. Eric Gunnerson turned me on to this idea. Here's an example.

Jay Bazuzi
A: 

Use PCRE. All other answers are just 2nd Best.

Geek
Can you add a reason?
Kieron
It lets you do different types of searches Text, Regex etc.It is a compiled library that let's you do so many things on so many platforms and has been tested for years.It will probably be much faster than the implementation you will write.
Geek