ansaurus

Question

Why C++ cannot be parsed with a LR(1) parser?

Answer 1

+31 A:

There is an interesting thread on Lambda the Ultimate that discusses the LALR grammar for C++.

It includes a link to a PhD thesis that includes a discussion of C++ parsing, which states that:

"C++ grammar is ambiguous, context-dependent and potentially requires infinite lookahead to resolve some ambiguities".

It goes on to give a number of examples (see page 147 of the pdf).

Rob Walker 2008-10-28 14:01:47

It'd be cool to have some summary about the page 147 on this page. I'm going to read that page though. (+1)

Cheery 2008-10-28 14:11:30

Answer 2

+5 A:

I think you are pretty close to the answer.

LR(1) means:

Parsing from left to right needs only one token to look-ahead for the context,

whereas LR(infinity) means an infinite look-ahead. That is, the parser would have to know everything that was coming in order to figure out where it is now.

casademora 2008-10-28 14:05:08

I recall from my compilers class that LR(n) for n > 0 is mathematically reducable to LR(1). Is that not true for n = infinity?

rmeador 2008-10-28 14:41:06

No, there's an impassable mountain of a difference between n and infinity.

ephemient 2008-10-28 14:53:28

Isn't the answer: Yes, given an infinite amount of time? :)

Steve Fallows 2008-10-28 15:56:25

Actually, by my *vague* recollection of how LR(n) -> LR(1) takes place, it involves creating new intermediate states, so the runtime is some non-constant function of 'n'. Translating LR(inf) -> LR(1) would take infinite time.

Aaron 2008-12-05 21:07:44

"Isn't the answer: Yes, given an infinite amount of time?" -- No: the phrase 'given an infinite amount of time' is just a non-sensical, short-hand way of saying "cannot be done given any finite amount of time". When you see "infinite", think: "not any finite".

ChrisW 2009-06-17 02:10:45

Answer 3

+41 A:

LR parsers can't handle ambiguous grammar rules, by design. (Made the theory easier back in the 1970s when the ideas were being worked out).

C and C++ both allow the following statement:

x * y ;

It has two different parses:

It can be the declaration of y, as pointer to type x
It can be a multiply of x and y, throwing away the answer.

Now, you might think the latter is stupid and should be ignored. Most would agree with you; however, there are cases where it might have a side effect (e.g., if multiply is overloaded). but that isn't the point. The point is there are two different parses.

The compiler must accept the appropriate one under the appropriate circumstances, and in the absence of any other information (e.g., knowledge of the type of x) must collect both in order to decide later what to do. Thus a grammar must allow this. And that makes the grammer ambiguous.

Thus LR can't handle this.

There are lots of more complicated cases, but only takes one to shoot down pure LR parsing.

Most real C/C++ parsers handle this by using some kind of deterministic parser intertwined with symbol table collection... so that by the time "x" is encountered, the parser knows if x is a type or not, and can thus choose between the two potential parses. But a parser that does this isn't context free, and LR parsers (the pure ones) are context free.

One can cheat, and add checks in the reduction proposal to LR parsers to do this disambiguation.

And if you cheat enough, you can make LR parsers work for C and C++. The GCC guys did for awhile, but gave it up for hand-coded parsing, I think because they wanted better error diagnostics.

There's another approach, though, which is nice and clean and parses C and C++ just fine without any symbol table hackery: GLR parsers. These are full context free parsers (having effectively infinite lookahead). GLR parsers simply accept both parses, producing a "tree" (actually a directed acyclic graph that is mostly tree like) that represents the ambiguous parse. A post-parsing pass can resolve the ambiguities.

We use this technique in the C and C++ front ends for the DMS Software Reengineering Tookit. They have been used to process millions of lines of large C and C++ systems, as well as dozens of other languages.

Ira Baxter 2009-06-17 02:01:10

While the 'x * y' example is interesting, the same can happen in C ('y' can be a typedef or a variable). But C can be parsed by a LR(1) parser, so what's the difference with C++?

Martin Cote 2009-06-17 02:16:21

My answser had already observed that C had the same problem, I think you missed that. No, it can't be parsed by LR(1), for the same reason. Er, what do you mean 'y' can be a typedef? Perhaps you meant 'x'? That doesn't change anything.

Ira Baxter 2009-06-17 06:06:36

Parse 2 is not necessarily stupid in C++, as * could be overridden to have side effects.

Dour High Arch 2009-09-09 16:15:41

In the provided example, it doesn't have any side effects. You are right, there's variations on this theme (say, overloaded "*") which could have side effects. As stated, people *might* think it was stupid. I've adjusted the text.

Ira Baxter 2009-09-09 17:52:20

I would count overloading a stateless operator to include side effects as a dumb use (flame bait but I'll elaborate). I always had that problem with operator overloading: none of the properties of an operator that would make it a reasonable analogy to the original use are guaranteed. For example, the comparison operators < > <= >= absolutely don't have to work as a partial order.

altie 2010-08-12 19:32:12

@altie: agreed, when you have overloading you want something like the Liskov substitution principle, which says the key properties of the overloaded operator are preserved. Alas, none of the OO langauges I know about (except Liskov's CLU) insist on that. So, you can object that the overload in C++ is dumb, but it is legal.

Ira Baxter 2010-08-12 20:04:59

Answer 4

+1 A:

As you can see in my answer here, C++ contains syntax that cannot be deterministically parsed by an LL or LR parser due to the type resolution stage (typically post-parsing) changing the order of operations, and therefore the fundamental shape of the AST (typically expected to be provided by a first-stage parse).

280Z28 2009-09-02 02:05:41

Parsing technology that handles ambiguity simply produces *both* AST variants as they parse, and simply eliminate the incorrect one depending on the type information.

Ira Baxter 2009-09-02 16:51:39

@Ira: Yes, that is correct. The particular advantage of that is it allows you maintain the separation of the first-stage parse. While it's most commonly known in the GLR parser, there's no particular reason I see that you couldn't hit C++ with a "GLL?" parser as well.

280Z28 2009-09-02 17:50:01

@280Z28: "GLL"? Well, sure, but you'll have to go figure out the theory and write up a paper for the rest of to use. More likely, you could use a top down hand coded parser, or a backtracking LALR() parser (but keep the "rejected") parses, or run an Earley parser. GLR has the advantage of being a pretty damn good solution, is well documented and by now well proved. A GLL technology would have to have some pretty significant advantages to display GLR.

Ira Baxter 2010-01-28 18:28:14

@280Z28: The Rascal project (Netherlands) is claiming they are building a scannerless GLL parser. Work in progress, might be hard to find any online information. http://en.wikipedia.org/wiki/RascalMPL

Ira Baxter 2010-08-21 21:46:24

ansaurus

tags:

views:

answers:

Why C++ cannot be parsed with a LR(1) parser?

related questions