views:

60

answers:

1

TLDR: if I built a multipurpose parser by hand with different code for each format, will it work better in the long run using one chunk of parser code and an ANTLR, PyParsing or similar grammar to specify each format?

Context: My job involves lots of benchmark log files from ~50 different benchmarks. There are a few in XML, a few HTML, a few CSV and lots of proprietary stuff with no documented spec. To save me and my coworkers the time of entering this data by hand, I wrote a parsing tool that handles all of the formats we deal with regularly with a uniform interface. The design, though, is not so clean.

I wrote this thing in Python and created a Parser class. Each file format is handled as an implementation that provides its own code for the Parser's read() method. I like the idea of having only one definition of Parser that uses grammars to understand each format, but I've never done it before.

Is it worth my time, and will it be easier for other newbies to work with in the future once I finish refactoring?

+2  A: 

I can't answer your question with 100% certainty, but I can give you an opinion.

I find the choice to use a proper grammar vs hand rolled regex "parsers" often comes down to how uniform the input is.

If the input is very uniform and you already know a language that deals with strings well, like Python or Perl, then I'd keep your existing code.

On the other hand I find parser generators, like Antlr, really shine when the input can have errors and inconsistencies in it. The reason is that the formal grammar allows you to focus on what should be matched in a certain context without having to worry about walking the input stream manually.

Furthermore if the input stream has an error then I find it's often easier to deal with them using Antlr vs regexs. The reason being is that if a couple of options are available Antlr has built in functionality for hosing the correct path, including rollback via predicates.

Having said all that, there is alot to be said for working code. I find if I want to rewrite something then I try to make a good use case for how the rewrite will benefit the user of the product.

chollida
Thanks for your response.The input is very much not uniform. The benchmarks we run rev all the time with very little guarantee the new results format will look like the old one. Files can be mangled, partial, or multiple batches of results concatenated together. I'm giving PyParsing a shot. I balked at Python parsing tools at first because I thought they'd have a learning curve for most users vs something more EBNF-like. Then I thought a little more about my coworkers and realized they're newer to scripting in general, so Python wouldn't be more of a barrier. Going good so far.
altie