views:

71

answers:

2

I'm trying to learn about shift-reduce parsing. Suppose we have the following grammar, using recursive rules that enforce order of operations, inspired by the ANSI C Yacc grammar:

S: A;

P
    : NUMBER
    | '(' S ')'
    ;

M
    : P
    | M '*' P
    | M '/' P
    ;

A
    : M
    | A '+' M
    | A '-' M
    ;

And we want to parse 1+2 using shift-reduce parsing. First, the 1 is shifted as a NUMBER. My question is, is it then reduced to P, then M, then A, then finally S? How does it know where to stop?

Suppose it does reduce all the way to S, then shifts '+'. We'd now have a stack containing:

S '+'

If we shift '2', the reductions might be:

S '+' NUMBER
S '+' P
S '+' M
S '+' A
S '+' S

Now, on either side of the last line, S could be P, M, A, or NUMBER, and it would still be valid in the sense that any combination would be a correct representation of the text. How does the parser "know" to make it

A '+' M

So that it can reduce the whole expression to A, then S? In other words, how does it know to stop reducing before shifting the next token? Is this a key difficulty in LR parser generation?


Edit: An addition to the question follows.

Now suppose we parse 1+2*3. Some shift/reduce operations are as follows:

Stack    | Input | Operation
---------+-------+----------------------------------------------
         | 1+2*3 | 
NUMBER   | +2*3  | Shift
A        | +2*3  | Reduce (looking ahead, we know to stop at A)
A+       | 2*3   | Shift
A+NUMBER | *3    | Shift (looking ahead, we know to stop at M)
A+M      | *3    | Reduce (looking ahead, we know to stop at M)

Is this correct (granted, it's not fully parsed yet)? Moreover, does lookahead by 1 symbol also tell us not to reduce A+M to A, as doing so would result in an inevitable syntax error after reading *3 ?

+4  A: 

The problem you're describing is an issue with creating LR(0) parsers - that is, bottom-up parsers that don't do any lookahead to symbols beyond the current one they are parsing. The grammar you've described doesn't appear to be an LR(0) grammar, which is why you run into trouble when trying to parse it w/o lookahead. It does appear to be LR(1), however, so by looking 1 symbol ahead in the input you could easily determine whether to shift or reduce. In this case, an LR(1) parser would look ahead when it had the 1 on the stack, see that the next symbol is a +, and realize that it shouldn't reduce past A (since that is the only thing it could reduce to that would still match a rule with + in the second position).

An interesting property of LR grammars is that for any grammar which is LR(k) for k>1, it is possible to construct an LR(1) grammar which is equivalent. However, the same does not extend all the way down to LR(0) - there are many grammars which cannot be converted to LR(0).

See here for more details on LR(k)-ness:

http://en.wikipedia.org/wiki/LR_parser

Amber
If I parse 1+2*3, the stack ends up at A+M at one point, by my understanding. That could be reduced to A, but that would be incorrect here, as it would yield A*..., for which no rule exists. Does looking ahead by 1 symbol indicate that this reduction should not occur as well? I added more detail on this to the original post.
Joey Adams
Yes, it does - because when you have `A+M` on the stack, and you look ahead to `*`, you see that you *must* have an `M` to the left of the `*`, so you know not to reduce if that would result in the top of the stack not being `M`.
Amber
+1  A: 

I'm not exactly sure of the Yacc / Bison parsing algorithm and when it prefers shifting over reducing, however I know that Bison supports LR(1) parsing which means it has a lookahead token. This means that tokens aren't passed to the stack immediately. Rather they wait until no more reductions can happen. Then, if shifting the next token makes sense it applies that operation.

First of all, in your case, if you're evaluating 1 + 2, it will shift 1. It will reduce that token to an A because the '+' lookahead token indicates that its the only valid course. Since there are no more reductions, it will shift the '+' token onto the stack and hold 2 as the lookahead. It will shift the 2 and reduce to an M since A + M produces an A and the expression is complete.

aduric