views:

114

answers:

5

I am working on new programming language, but I was always puzzled by the fact that everyone is using yaxx/lex to parse the code, but I am not.

My compiler (which is already working) is handcoded in C++/STL, and I cannot say it's complex or took too much time. It has both some kind of lexer and parser, but they are not autogenerated.

Earlier, I wrote a C compiler(not full spec) the same way - it was able to compile the program in 1 pass, with all these back references resolving & preprocessing - this is definitely impossible with yacc/lex.

I just cannot convince myself to scrap all this, and start diving into yaxx/lex - which might need quite an effort to implement and might possibly introduce some grammar limitations.

Is there something I miss when not using yacc/lex? Do I do an evil thing?

+1  A: 

Maybe you are missing out on ANTLR, which is good for languages that can be defined with a recursive-descent parsing strategy.

There are potentially some advantages to using Yacc/Lex, but it is not mandatory to use them. There are some downsides to using Yacc/Lex too, but the advantages usually outweigh the disadvantages. In particular, it is often easier to maintain a Yacc-driven grammar than a hand-coded one, and you benefit from the automation that Yacc provides.

However, writing your own parser from scratch is not an evil thing to do. It may make it harder to maintain in future, but it may make it easier, too.

Jonathan Leffler
+3  A: 

The main advantages of using any kind of lexer/parser generator is that it gives you a lot more flexibility if your language evolves. In a hand-coded lexer/parser (especially if you've mixed in a lot of functionality in a single pass!), changes to the language get nasty fairly quickly, whereas with a parser generator you make the change, re-run the generator, and move on with your life. There are certainly no inherent technical limitations to always just writing everything by hand, but I think the evolvability and maintainability of automating away the boring bits is worth it!

Gian
Good, modular design (for example, tokenizer and lexical analyser should be completely seperated) can make this less of an issue. Still important.
delnan
+2  A: 

The other huge advantage of using generators is that they are guaranteed to process exactly and only the language you specified in the grammar. You can't say that of any hand-written code. The LR/LALR variants are also guaranteed to be O(N), which again you can't assert about any hand coding, at least not without a lot of effort in constructing the proof.

I've written both and lived with both and I would never hand-code again. I only did that one because I didn't have yacc on the platform at the time.

EJP
Well, I do not need a mathematical proof to know that my code is fast - usually it's obvious that it's somewhere in O(N) or O(N*LogN).
BarsMonster
I didn't say anything about 'fast'. I am talking about algorithmic complexity, and you need a proof for that.
EJP
Well, when I write code, I always know it's complexity. If tool if writing code for me - then I definitely would need a proof.
BarsMonster
You've got it back to front. LR parsing has an O(N) proof by Don Knuth, 1965, and LALR parsing ditto by Frank deRemer, 1969. Where's yours? You can't just 'know' the algorithmic complexity of something as complex as a hand-written compiler.
EJP
Okay, so he doesn't know (for sure) - so what? "Compiles in O(n)" is not much of a selling point. If it's "fast enough" for average projects, nobody will care.
delnan
If this is a marketing discussion it's news to me. What I am discussing here are the advantages of compiler generators, which is what the OP was asking about. Accepting exactly the correct language and O(1) complexity are important whether you can market them directly or not, just as sunk cost, maintainability, extemsibility, and time to market are too, although you don't market them either.Otherwise what exactly were Don Knuth and Frank de Remer doing?
EJP
A: 

It certainly depends on the complexity of your language grammar. An easy grammar means that there is an easy implementation and you can just do it yourself.

Take a look at maybe the worst possible example at all: C++ :) (Does anybody knows another language, besides natural languages, which are more difficult to parse correctly?) Even with tools like Antlr, is it quite difficult to get it right, though it is manageable. Whereby on the other side, even while being much harder, it seems that some of the best C++ parsers, e.g. GCC and LLVM, are also mostly handwritten.

If you don't need too much flexibility and your language is not too trivial, you will certainly safe some work/time by using Antlr.

Albert
The complexity of my language is just a little less than C++.
BarsMonster
Classic Fortran is pretty hard to parse: the assignment "`DO10I=12.43`" and the DO loop "`DO10I=12,43`" cannot even be split up into words accurately until you get to the comma vs full stop. (The first involves a floating point variable `DO10I`, the second involves an integer variable `I` and a label `10`: `DO 10 I = 12, 43`.) This tends to make it rather hard!
Jonathan Leffler
+1  A: 

Yacc is inflexible in some ways:

  • good error handling is hard (basically, its algorithm is only defined to parse a correct string correctly, otherwise, all bets are off; this is one of the reasons that GCC moved to a hand-written parser)
  • context-dependency is hard to express, whereas with a hand-written recursive descent parser you can simply add a parameter to the functions

Furthermore, I have noticed that lex/yacc object code is often bigger than a hand-written recursive descent parser (source code tends to be the other way round).

I have not used ANTLR so I cannot say if that is better at these points.

jilles