tags:

views:

355

answers:

8

I have been programming since 1999 for work and fun. I want to learn new things, and lately I've been focused on parsing, as a large part of my job is reading, integrating and analyzing data. I also have a large number of repetitive tasks that I think I could express in very simple domain-specific languages if the overhead was low enough. I have a few questions about the subject.

  1. Most of my current parsing code don't define a formal grammar. I usually hack something together in my language of choice because that's easy, I know how to do it and I can write that code very fast. It's also easy for other people I work with to maintain. What are the advantages and disadvantages of defining a grammar and generating a real parser (as one would do with ANTLR or YACC) to parse things compared with the hacks that most programmers used to write parsers?
  2. What are the best parser generation tools for writing grammar-based parsers in C++, Perl and Ruby? I've looked at ANTLR and haven't found much about using ANTLRv3 with a C++ target, but otherwise that looks interesting. What are the other tools that are similar to ANTLR that I should be reading about?
  3. What are the canonical books and articles that someone interested in learning more about parsing? A course in compilers unfortunately wasn't part of my education, so basic material is very welcome. I've heard great things about the Dragon Book, but what else is out there?
+2  A: 

Let's Build A Compiler is a step-by-step tutorial on how to write a simple compiler. The code is written in Delphi (Pascal), but it's basic enough to easily translate into most other languages.

Chris R. Timmons
Funny, I was actually going to recommend the exact same thing, but couldn't remember what it was called. +1
musicfreak
+1  A: 

In perl, the Parse::RecDescent modules is the first place to start. Add tutorial to the module name and Google should be able to find plenty of tutorials to get you started.

Marius Kjeldahl
+3  A: 

Here's my take on your (very good) questions:

  1. I think a parser benefits most from non-trivial situations where a grammar actually exists. You have to know about how parsers and grammars work to think of that technique, and not every developer does.
  2. lex/yacc are older Unix tools that might be usable for you as a C++ developer. Maybe Bison as well.
  3. ANTRL and its attendant book are very good. "Writing Compilers and Interpreters" has C++ examples which you might like.

The GoF Interpreter pattern is another technique for writing "little languages". Take a look at that.

duffymo
+3  A: 

On 1., I would say the main advantage is maintainability -- making a little change to the language just means making a correspondingly-small change to the grammar, rather than minutely hacking through the various spots in the code that may have something to do with what you want changed... orders of magnitude better productivity and smaller risk of bugs.

On 2. and 3., I can't suggest much beyond what you already found (I mostly use Python and pyparsing, and could comment from experience on many Python-centered parse frameworks, but for C++ I mostly use good old yacc or bison anyway, and my old gnarled copy of the Dragon Book -- not the latest edition, actually -- is all I keep at my side for the purpose...).

Alex Martelli
+1  A: 

Defining a grammar using BNF, EBNF or something similar, is easier and later on you will have a better time maintaining it. Also, you can find a lot of examples of grammar definitions. Last but not least, if you are going to talk about your grammar to someone else on the field, it is better if you are both speaking the same language (BNF, EBNF etc.).

Writing your own parsing code is like reinventing the wheel and is prone to errors. It is also less maintainable. Of course, it can be more flexible, and for small projects it might also be a good choice, but using an existing parser generator that takes a grammar and spits out the code should cover most of our needs.

For C++ I would also suggest lex/yacc. For Ruby this looks like a decent choice: Coco/R(uby)

Petros
+1  A: 

Funny timing: I spent lots of this morning wondering about state machines and parsers, and trying to figure out how I could learn more about them.

For 2, you might take a look at Ragel (it's good for C++ and Ruby).

Telemachus
I forgot to mention in the question that I spent last night working on a machine to read your mind. :) Thanks for the Ragel suggestion, I'll definitely look at it!
James Thompson
+2  A: 

I would have a serious look at monadic combinator-based parsing (which often also deals with lexical analysis) in Haskell. I found it quite an eye opener; it's amazing how easily you can build a parser from scratch using this method. It's so easy, in fact, that it's often faster to write your own parser than it is to try to use existing libraries.

The most famous example is probably Parsec which has a good user guide that explains how to use it. There is a list of ports of this library to other languages (including C++ and Ruby) listed on the Parsec page of the Haskell wiki, though I'm not familiar with them and so I can't say how close they are to using Parsec in Haskell.

If you want to learn how these work internally and how to write your own, I recommend starting with Chapter 8 ("Functional Parsers") from Graham Hutton's Programming in Haskell. Once you understand that chapter well (which will probably take several readings), you'll be set.

Curt Sampson
+1  A: 

Here's a tutorial on a self-contained (10 pages!), completely portable compiler-compiler which can be used to design and implement "low overhead" DSLs very quickly:

http://www.bayfronttechnologies.com/mc_tutorial.html

This site walks you through Val Schorre's 1964 paper on MetaII. Yes, 1964. And it is amazing. This is how I learned about compilers back in 1970.

Ira Baxter