views:

129

answers:

4

Hi,

I would like to do some parsing and tokenizing in c++ for learning purposes. Now I often times came across bison/yacc and lex when reading about this subject online. Would there be any mayor benefit of using those over for instance a tokenizer/parser written using STL or boost::regex or maybe even just C?

+2  A: 

Somebody else has already written and DEBUGGED them for you?

Martin Beckett
+7  A: 

I recently undertook writing a simple lexer and parser.

It turned out that the lexer was simpler to code by hand. But the parser was a little more difficult. My Bison-generated parser worked almost right off the bat, and it gave me a lot of helpful messages about where I had forgotten about states. I later wrote the same parser by hand but it took a lot more debugging before I had it working perfectly.

The appeal of generating tools for lexers and parsers is that you can write the specification in a clean, easy-to-read language that comes close to being a shortest-possible rendition of your spec. A hand-written parser is usually at least twice as big. Also, the automated parser (/lexer) comes with a lot of diagnostic code and logic to help you get the thing debugged.

A parser/lexer spec in BNF-like language is also a lot easier to change, should your language or requirements change. If you're dealing with a hand-written parser/lexer, you may need to dig deeply into your code and make significant changes.

Finally, because they're often implemented as finite state machines without backtracking (gazillions of options on Bison, so this is not always a given), it's quite possible that your auto-generated code will be more efficient than your hand-coded product.

Carl Smotricz
thanks for your detailed answer, I guess I will try both just for comparison, since its just for fun anyways!
+1  A: 

Its easier and they are more general. Bison/Lex can tonkenize and parse arbitrary grammar and present it in what may be an easier format. They might be faster as well, depending on how well you write your regex.

I wouldn't want to write my own parser in C since the language doesn't have great intuition about strings. If you write your own, I would recommend perl for ease of regex (or possibly python).

It is probably faster to use existing tools, but it may or may not be as much fun. If you have time and since it is just for learning, go for it. C++ is a good language to start with.

Adam Shiemke
They certainly cannot parse "arbitrary grammar".
anon
A: 

Different strokes for different folks. I personally like recursive descent parsers - I find them easy to understand and you can make them produce superior end-user error messages to those produced by tools like bison.

anon
I also find that they tend to be a bit more robust in the face of the tricky edge cases in some languages where parsing and lexing overlap.
Kylotan