views:

223

answers:

1

I was enjoying using YARD a lot:

http://www.ootl.org/yard/

http://code.google.com/p/yardparser/

http://www.codeproject.com/KB/recipes/yard-tokenizer.aspx

I was able to construct fully functional calculator. I'm evaluating YARD to do PHP parser. Please kindly advise on limitations of PEG grammar & parser generators. Thank you very much!

+5  A: 

I think the big "problem" with PEGs is that they don't fit into the normal taxonomy of grammars as they operate in a fundamentally different way. Normal grammars are "backwards" in the sense that they describe all the possible sentences (programs) that can be generated. PEGs describe how to parse--they come at the problem from the other end.

In my view this is a more natural way to think about the problem, and certainly for any hand-written (recursive-descent) parser I wouldn't do anything else.

DrPizza
thanks DrPizza! I read that PEG cannot parse Python and C++ in context sensitive parts. Not sure this is true. I'm trying to write PHP parser and found PEG solutions very easy, compared to those Bison/Yacc.
Viet
Most parsers can't properly deal with context-sensitive grammars without hacks of some kind (e.g. for parsing C you make the parser feed back into the lexer so that it assigns the right symbol type to type names, so that they don't get treated as regular identifiers).PEGs are interesting because they can directly express the disambiguation rules that C and C++ use (I don't know about Python). Specifically, "if it looks like a declaration, it is". They can do this by ordering their rules so that the declaration rule is tried before the statement rule.
DrPizza
That said, I don't know if PEGs can handle the entire C++ or Python grammars; I've not tried.
DrPizza
+1 Thanks again for your clarification.
Viet
Ordering rules doesn't help, if the meaning of the parse is determined by other information. C++ infamously allows "x*y;" as a statement, with two interpretations: a declaration, or an arithmetic operation. No rule ordering will help you decide what this is. You need context information. C and C++ parsers often hack this by building a symbol table as they go; knowing that x is a type resolves the problem. But if the definition of x or y appears *after* the statement, even this hack doesn't work. The safe bet is GLR parsers, which simply pick up both parses to resolve later.
Ira Baxter
+1 Nice example Ira! So I conclude that PEG are not suitable/optimal for C/C++ parsing. What about using it for PHP?
Viet
Ah yes, of course. The solution as ever is to have actions associated with rules; Boost's Spirit2 parser framework uses PEGs as its underlying model, and I believe it allows suitable actions--every successful type declaration should add its identifiers to a table of type names. In conjunction with PEG ordering (declaration rules are tried before expression rules), the PEG will do the right thing. The Spirit2 source is unfortunately the usual impenetrable stuff we expect from Boost.
DrPizza
The real value of PEGs, btw, is in designing your own languages; using a PEG ensures that the language is unambiguously parsable, rather than the traditional approach to language design which is to either not care about parsing (and come up with abominable syntaxes like C and C++) or to design a syntax and then beat on it until it eventually becomes something that your tool (traditionally yacc) can actually parse. By making the fundamental operation parsing (rather than sentence-generation), PEGs make this aspect of language design much easier.
DrPizza
(adding semantic actions that update a table of typenames does, however, cause a problem if you want to memoize your results, as is done in Packrat parsing)
DrPizza