views:

304

answers:

3

I have a grammar like this:

"Match one or more rule1 where rule1 is one or more rule2, where rule2 is one or more rule3, etc. etc. each seperated by newlines". Look at the following example.

start:   rule1_list
      ;

rule1_list:   rule1
           |  rule1_list NEWLINE rule1
            ;

rule1:   rule2
     |   rule2 NEWLINE rule3_list
      ;

rule2:   TERMINAL2
      ;

rule3_list:   rule3
          |   rule3_list NEWLINE rule3
          ;

rule3 :  TERMINAL3
      ;

I get shift/reduce conflicts doing this, how can I change the grammar to stop? Essentially it needs to branch after a new line and look if the next thing is a TERMINAL2 or TERMINAL3.

+3  A: 

Ambiguous grammar, not LALR(1), unparsable by default yacc mode

To make a long story short, you can "fix" this with a %glr-parser declaration as follows:

%glr-parser
%%
start: rule1_list
. . .
. . .


To make a long story kind of medium-length...

Shift-reduce conflicts are normally not errors. The conflict is resolved by always doing the shift which is usually what you want. Most or all real-world grammars have shift-reduce conflicts. And if you wanted the reduction you can arrange for that with precedence declarations.

However, in a truely ambiguous grammar, doing the shift will send the parser down one of two paths, only one of which will ultimately find a string in the grammar. In this case the S/R conflict is a fatal error.

Analyzing the first one, when the parser sees the newline in the | rule2 NEWLINE rule3_list case, it can either shift to a new state where it will be expected a rule3_list, or it can reduce a rule1 using rule1: rule2. It will always look for a rule3_list due to the default choice of shift.

The second conflict occurs when it sees a newline in rule3_list: rule3_list . NEWLINE rule3. Now it can either shift and start looking for rule3 or reduce a rule1 using | rule2 NEWLINE rule3_list.

The result is that as written, and assuming '2' and '3' for the terminals, you can only parse a 2 line followed by 3 lines. If you fiddle with the precedence you can only parse '2' lines and never '3' lines.

Finally, I should add that going with a yacc-generated GLR parser is something of a kludge. I imagine it will work just fine but it's pure BFI, the parser splits, keeps two stacks, continues down both paths until one finds a string in the grammar. Sadly, the other fixes are kludges too: 1. reformulate the grammar as LALR(1), 2. add extra lookahead in the scanner and return a composite token, 3. Experiment with the rules for the grammar you have, perhaps yacc can handle a variation.

This is why I don't actually like yacc and prefer hand-written recursive descent or something more modern like PEG. (See Treetop.)

I tried something with (the preferred) left-recursive rules that simply ignored the newlines (which complicate your grammar, making whitespace tokens...) .. and this "works", although I'm not sure if it's what you want...

%%
start:   stmtList
      ;

stmtList: /* nothing */ 
      | stmtList '2' threeList;
      ;

threeList: /* nothing */
      | threeList '3'
      ;
%%
int yylex() { int c; do {  c = getchar (); } while (c == '\n'); return c; }
DigitalRoss
Oh, idea, if you just skipped newlines in the scanner that might give you enough lookahead. By returning white space as a token you have unnecessarily increased the amount of look-ahead required by the parser.
DigitalRoss
Interesting, but then how will test that the user has not inputted newlines all over the place?
DevDevDev
You can always push logic down into the scanner where you have full control and can write any C code you want. You could error out tokens not followed by a single newline by returning an error token from yylex(). Or, if newlines are white space, then why not just allow an arbitrary number of them? That's what compilers and assemblers do...
DigitalRoss
Hey I'm still getting the error, "Expecting TERMINAL3 got TERMINAL2"
DevDevDev
R U sure you reran both yacc and cc? Here is my test case for your original grammar, I just ran it now. I know it works: http://pastie.org/706454
DigitalRoss
I guess "works" is kind of relative. It will take a string of 2's and 3's as you seem to have intended but I don't think it recognizes a complete string beginning at start so it always ends the input with a syntax error. The difference is that all previous versions would not reduce the '2' list, where as this one properly recognizes '2' lists and '3' lists.
DigitalRoss
BTW, the grammar I stuck on pastie.org *does* correctly handle end-of-input and reducing back to the start rule without syntax errors at end-of-input. I should also add that you didn't provide any test inputs, so my guess on what the grammar *should* have taken was based on what it *actually* took, so I could have got that way wrong...
DigitalRoss
A: 

I think you have to convert the left recursions to right recursions. An example for rule3_list:

rule3_list: TERMINAL3 | TERMINAL3 NEWLINE rule3_list;
Joy Dutta
A: 

not ambiguous, just not LALR(1)

The problem is that several places in the grammar require 2-token lookeahead, to see which TERMINAL comes after a NEWLINE in order to decide what to do. There are a number of things you could do to fix this.

  1. skip newlines in the scaaner -- then they would no longer be tokens and wouldn't get in the way of the lookahead

  2. Use %glr-parser. This can be dangerous if you ever do introduce amibiguities in your grammar, as they would require merge functions to make things work. There's no good way to determine if any given conflict is due to ambiguity or just needing more lookahead -- you need to carefully analyze each conflict bison reports to tell.

  3. refactor the grammar to defer decisions, so not as much lookahead is required. One easy choice is to absorb the newlines into the rules as terminators instead of separators:

    start:   rule1_list ;
    
    
    rule1_list:   rule1
              |  rule1_list rule1
              ;
    
    
    rule1:   rule2
         |   rule2 rule3_list
         ;
    
    
    rule2:   TERMINAL2 NEWLINE ;
    
    
    rule3_list:   rule3
              |   rule3_list rule3
              ;
    
    
    rule3 :  TERMINAL3 NEWLINE ;
    

of course, this changes the grammar, in that a newline is now required after the last rule before the EOF

Chris Dodd