views:

2100

answers:

4

I were reading about parsers and parser generators when I hit upon this statement in wikipedia's LR parsing -page:

"Many programming languages can be parsed using some variation of an LR parser. One notable exception is C++."

Why is it so? What particular property in C++ causes it to be impossible to parse with LR parsers?

I first tried google and only found out that C can be perfectly parsed with LR(1) but C++ requires LR(∞). But then I think this is a good question. I understand that it must be very interesting reason why this is so.

(PS. wikipedia's parser articles make much more sense if you have already written couple of parsers yourself with parser generators or perhaps by hand, before that you do not really understand much about them. It's interesting how you need to use the tool to learn about how to build good one, but then.. so obvious)

+31  A: 

There is an interesting thread on Lambda the Ultimate that discusses the LALR grammar for C++.

It includes a link to a PhD thesis that includes a discussion of C++ parsing, which states that:

"C++ grammar is ambiguous, context-dependent and potentially requires infinite lookahead to resolve some ambiguities".

It goes on to give a number of examples (see page 147 of the pdf).

Rob Walker
It'd be cool to have some summary about the page 147 on this page. I'm going to read that page though. (+1)
Cheery
+5  A: 

I think you are pretty close to the answer.

LR(1) means:

Parsing from left to right needs only one token to look-ahead for the context,

whereas LR(infinity) means an infinite look-ahead. That is, the parser would have to know everything that was coming in order to figure out where it is now.

casademora
I recall from my compilers class that LR(n) for n > 0 is mathematically reducable to LR(1). Is that not true for n = infinity?
rmeador
No, there's an impassable mountain of a difference between n and infinity.
ephemient
Isn't the answer: Yes, given an infinite amount of time? :)
Steve Fallows
Actually, by my *vague* recollection of how LR(n) -> LR(1) takes place, it involves creating new intermediate states, so the runtime is some non-constant function of 'n'. Translating LR(inf) -> LR(1) would take infinite time.
Aaron
"Isn't the answer: Yes, given an infinite amount of time?" -- No: the phrase 'given an infinite amount of time' is just a non-sensical, short-hand way of saying "cannot be done given any finite amount of time". When you see "infinite", think: "not any finite".
ChrisW
+41  A: 

LR parsers can't handle ambiguous grammar rules, by design. (Made the theory easier back in the 1970s when the ideas were being worked out).

C and C++ both allow the following statement:

x * y ;

It has two different parses:

  1. It can be the declaration of y, as pointer to type x
  2. It can be a multiply of x and y, throwing away the answer.

Now, you might think the latter is stupid and should be ignored. Most would agree with you; however, there are cases where it might have a side effect (e.g., if multiply is overloaded). but that isn't the point. The point is there are two different parses.

The compiler must accept the appropriate one under the appropriate circumstances, and in the absence of any other information (e.g., knowledge of the type of x) must collect both in order to decide later what to do. Thus a grammar must allow this. And that makes the grammer ambiguous.

Thus LR can't handle this.

There are lots of more complicated cases, but only takes one to shoot down pure LR parsing.

Most real C/C++ parsers handle this by using some kind of deterministic parser intertwined with symbol table collection... so that by the time "x" is encountered, the parser knows if x is a type or not, and can thus choose between the two potential parses. But a parser that does this isn't context free, and LR parsers (the pure ones) are context free.

One can cheat, and add checks in the reduction proposal to LR parsers to do this disambiguation.

And if you cheat enough, you can make LR parsers work for C and C++. The GCC guys did for awhile, but gave it up for hand-coded parsing, I think because they wanted better error diagnostics.

There's another approach, though, which is nice and clean and parses C and C++ just fine without any symbol table hackery: GLR parsers. These are full context free parsers (having effectively infinite lookahead). GLR parsers simply accept both parses, producing a "tree" (actually a directed acyclic graph that is mostly tree like) that represents the ambiguous parse. A post-parsing pass can resolve the ambiguities.

We use this technique in the C and C++ front ends for the DMS Software Reengineering Tookit. They have been used to process millions of lines of large C and C++ systems, as well as dozens of other languages.

Ira Baxter
While the 'x * y' example is interesting, the same can happen in C ('y' can be a typedef or a variable). But C can be parsed by a LR(1) parser, so what's the difference with C++?
Martin Cote
My answser had already observed that C had the same problem, I think you missed that. No, it can't be parsed by LR(1), for the same reason. Er, what do you mean 'y' can be a typedef? Perhaps you meant 'x'? That doesn't change anything.
Ira Baxter
Parse 2 is not necessarily stupid in C++, as * could be overridden to have side effects.
Dour High Arch
In the provided example, it doesn't have any side effects. You are right, there's variations on this theme (say, overloaded "*") which could have side effects. As stated, people *might* think it was stupid. I've adjusted the text.
Ira Baxter
I would count overloading a stateless operator to include side effects as a dumb use (flame bait but I'll elaborate). I always had that problem with operator overloading: none of the properties of an operator that would make it a reasonable analogy to the original use are guaranteed. For example, the comparison operators < > <= >= absolutely don't have to work as a partial order.
altie
@altie: agreed, when you have overloading you want something like the Liskov substitution principle, which says the key properties of the overloaded operator are preserved. Alas, none of the OO langauges I know about (except Liskov's CLU) insist on that. So, you can object that the overload in C++ is dumb, but it is legal.
Ira Baxter
+1  A: 

As you can see in my answer here, C++ contains syntax that cannot be deterministically parsed by an LL or LR parser due to the type resolution stage (typically post-parsing) changing the order of operations, and therefore the fundamental shape of the AST (typically expected to be provided by a first-stage parse).

280Z28
Parsing technology that handles ambiguity simply produces *both* AST variants as they parse, and simply eliminate the incorrect one depending on the type information.
Ira Baxter
@Ira: Yes, that is correct. The particular advantage of that is it allows you maintain the separation of the first-stage parse. While it's most commonly known in the GLR parser, there's no particular reason I see that you couldn't hit C++ with a "GLL?" parser as well.
280Z28
@280Z28: "GLL"? Well, sure, but you'll have to go figure out the theory and write up a paper for the rest of to use. More likely, you could use a top down hand coded parser, or a backtracking LALR() parser (but keep the "rejected") parses, or run an Earley parser. GLR has the advantage of being a pretty damn good solution, is well documented and by now well proved. A GLL technology would have to have some pretty significant advantages to display GLR.
Ira Baxter
@280Z28: The Rascal project (Netherlands) is claiming they are building a scannerless GLL parser. Work in progress, might be hard to find any online information. http://en.wikipedia.org/wiki/RascalMPL
Ira Baxter