ansaurus

Question

Strange problem with context free grammar

Answer 1

A:

So I spent a reasonable amount of time picking over the grammar (and the bison output) and can't see what is obviously wrong here. Without having the means to execute it, I can't easily figure out what is going on by experimentation. Therefore, here are some concrete steps I usually go through when debugging grammars. Hopefully you can do any of these you haven't already done and then perhaps post follow-ups (or edit your question) with any results that might be revealing:

Contrive the smallest (in terms of number of tokens) possible working input, and the smallest possible non-working inputs based on the rules you expect to be applied.
Create a copy of the grammar file including only the troublesome rules and as few other supporting rules as you can get away with (i.e. you want a language that only allows construction of sequences consisting of the object and more_object rules, joined by ARROW. Does this work as you expect?
Does the rule in which it is nested work as you expect? Try replacing object with some other very simple rule (using some tokens not occuring elsewhere) and seeing if you can include those tokens without it breaking everything else.
Run bison with --report=all. Inspect the output to try to trace the rules you've added and the states that they affect. Try removing those rules and repeat the process - what has changed? This is extremely time consuming often, and is a giant pain, but it's a good last resort. I recommend a pencil and some paper.

Looking at the structure of your error output - '+' is being recognised as an identifier token, and is therefore being looked up as a symbol. It might be worth checker your lexer to see how it is processing identifier tokens. You might just accidentally be grabbing too much. As a further debugging technique, you might consider turning some of those token literals (e.g. '+', '{', etc) into real tokens so that bison's error reporting can help you out a little more.

EDIT: OK, the more I've dug into it, the more I'm convinced that the lexer is not necessarily working as it should be. I would double-check that the stream of tokens you are getting from yylex() matches your expectations before proceeding any further. In particular, it looks like a bunch of symbols that you consider special (e.g. '+' and '{') are being captured by some of your regular expressions, or at least are being allowed to pass for identifiers.

Gian 2010-08-06 03:02:25

I'll certainly try some of these (thanks for the --report=all comment; didn't know about that) but I have done a little research already. One I think telling caveat of this bug is that if I change `object: object_name ARROW more_objects;` to `object: '$' object_name ARROW more_objects;` then everything works fine. In other words, when the FIRST of `object` isn't `IDENTIFIER` there is no problem. I feel like it provides valuable insight even if I don't know what yet.

Chris 2010-08-06 14:19:44

If you substitute object_name for id_lookup, do you start getting shift/reduce conflicts?

Gian 2010-08-06 14:58:07

No change in behavior when I swap `object_name` for `id_lookup`

Chris 2010-08-06 15:52:22

Hang on a second... you say you've looked at "the bison output" - what do you mean? I think it's worth clarifying that the error output ("Attempting to use undefined symbol '+' on line 3.") is coming from my parser when I run the script, not from Bison processing the .y file.

Chris 2010-08-07 18:29:48

Yes, I guessed that. That output is generated when the id_lookup production rule fails, correct? This means that the symbol "+" is ending up being looked up by id_lookup. I also took the linked grammar file and ran it through bison with the various reporting levels turned on. I couldn't see anything obviously wrong in that output. Have you manually verified that the tokens coming out of your lexer are correct as a first step? Do a little wrapper main() that just calls yylex() and spits out the tokens. If that output is actually correct, then perhaps you could post that output for us?

Gian 2010-08-08 15:09:02

@Chris: I second Gian's suggestion. Verify your lexer produces the tokens you think it should, before you decide the parser is ill. The bit about "+" being diagnosed by id_lookup is pretty suspicious.

Ira Baxter 2010-08-15 05:07:10

Answer 2

+2 A:

It seems you are doing a classical error when using direct strings in the yacc source file. Since you are using a lexer, you can only use token names in yacc source files. More on this here

jpalecek 2010-08-06 23:32:14

Actually this method has been working just fine, but I'd been wondering whether it was poor practice. At any rate, it has no impact on the problem at hand.

Chris 2010-08-07 18:14:04

Answer 3

+1 A:

I think your principal problem is that you failed to define a subtree constructor in your object subgrammar. (EDIT: OP says he left the semantic actions for object out of his example text. That doesn't change the following answer).

You probably have to lookup up the objects in the order encountered, too. Maybe you intended:

primary_expression 
   : constant_value                                        { $$ = $1; } 
   | '(' expression ')'                                    { $$ = $2; } 
   | list_initialization                                   { $$ = $1; } 
   | function_call                                         { $$ = $1; } 
   | object                                                { $$ = $1; } 
   ; 



object
   : IDENTIFIER    { $$ = LookupVariableOrObject( yytext ); } 
   |  object ARROW IDENTIFIER  { $$ = LookupSubobject( $1, yytext ); } 
   ;

I assume that if one encounters an identifier X by itself, your default interpretation is that it is a variable name. But, if you encounter X -> Y, then even if X is a variable name, you want the object X with subobject Y.

What LookupVarOrObject does is to lookup the leftmost identifier encountered to see if it is variable (and return essentially the same value as idlookup which must produce an AST node of type AST_VAR), or see if it is valid object name, and return an AST node marked as an AST_OBJ, or complain if the identifier isn't one of these.

What LookupSuboject does, is to check its left operand to ensure it is an AST_OBJ (or an AST_VAR whose name happens to be the same as that of an object). and complain if it is not. If it is, then its looks up the yytext-child object of the named AST_OBJ.

EDIT: Based on discussion comments in another answer, right-recursion in the OP's original grammar might be problematic if the OP's semantic checks inspect global lexer state (yytext). This solution is left-recursive and won't run afoul of that particular trap.

Ira Baxter 2010-08-13 05:37:38

Answer 4

+1 A:

You don't get shift/reduce conflicts because your rules using object_name and more_objects are right-recursive - rather than the left-recursive rules that Yacc (Bison) handles most naturally.

On classic Yacc, you would find that you can run out of stack space with deep enough nesting of the 'object->name->what->not' notation. Bison extends its stack at runtime, so you have to run out of memory, which is a lot harder these days than it was when machines had a few megabytes of memory (or less).

One result of the right-recursion is that no reductions occur until you read the last of the object names in the chain (or, more accurately, one symbol beyond that). I see that you've used right-recursion with your statement_list rule too - and in a number of other places too.

Jonathan Leffler 2010-08-13 06:47:05

You'll run out of stack space only if it right recurses a huge number of times, unless your stack is rediculously small. No human being will code a->b->...->z for a very long chain, so I just don't see how that can be a practical objection.

Ira Baxter 2010-08-15 03:22:32

Right reductions shouldn't prevent him from getting a working result. I personally prefer left recursive lists because they "feel" better and they are a tiny bit more efficient, but that's just personal bias and not a technical issue.

Ira Baxter 2010-08-15 03:24:10

@Ira: agreed - right reduction doesn't prevent the grammar working. In classic Yacc, you could run out of right-recursive stack relatively quickly (150 or so), but I agree it is not a practical objection. What is a factor, though, is that the lexer has to read much further before anything can be reduced, and the symptoms of the trouble could be related to that - if the symbols are not saved correctly as they are read. Yacc deals in LALR(1) grammars - Look Ahead Left-Right grammars with 1 token of look-ahead.

Jonathan Leffler 2010-08-15 03:47:08

@Jonathan: His tokens aren't ambiguous, so it doesn't matter how far ahead the lexer gets as long as every lexeme carries its complete state. However, if he expects to test the state of yytext on an early token, and yytext is a global variable overwritten by the most recent collected token, *then* he could have problem with a right recursive rule, I'll agree. If that's his problem, though, the disease is caused by each lexeme not carrying its entire state, e.g., he's being hurt by his lexer/parser generator. That would be a real shame and he'd still have to fix it.

Ira Baxter 2010-08-15 04:17:22

Answer 5

A:

id_lookup : IDENTIFIER

is formally identical to

object_name : IDENTIFIER

and object_name would accept everything that id_lookup wouldn't, so assertLookup( yytext ); probably runs on everything that may look like IDENTIFIER and is not accepted by enother rule just to decide between the 2 and then object_name can't accept because single lookahead forbids that.

For the twilight zone, the two chars that you got errors for are not declared as tokens with opends the zone of undefinded behavior and could trip parser into trying to treat them as potential identifiers when the grammar gets loose.

ZXX 2010-08-14 01:33:47

Answer 6

+1 A:

I just tried running muscl in Ubuntu 10.04 using bison 2.4.1 and I was able to run both of your examples with no syntax errors. My guess is that you have a bug in your version of bison. Let me know if I'm somehow running your parser wrong. Below is the output from the first example you gave.

./muscle < ./test1.m (this was your first test)

\-statement list
  |-declaration (constant)
  | |-symbol reference
  | | \-list (constant)
  | \-list
  |   |-value
  |   | \-1
  |   |-value
  |   | \-2
  |   |-value
  |   | \-3
  |   \-value
  |     \-4
  |-loop (for-in)
  | |-symbol reference
  | | \-x (variable)
  | |-symbol reference
  | | \-list (constant)
  | \-statement list
  |   \-send statement
  |     \-binary op (addition)
  |       |-binary op (addition)
  |       | |-value
  |       | | \-foo 
  |       | \-symbol reference
  |       |   \-x (variable)
  |       \-value
  |         \-!
  \-send statement
    \-value
      \-Done!

 +-----+----------+-----------------------+-----------------------+
 |   1 | VALUE    | 1                     |                       |
 |   2 | ELMT     | @1                    |                       |
 |   3 | VALUE    | 2                     |                       |
 |   4 | ELMT     | @3                    |                       |
 |   5 | VALUE    | 3                     |                       |
 |   6 | ELMT     | @5                    |                       |
 |   7 | VALUE    | 4                     |                       |
 |   8 | ELMT     | @7                    |                       |
 |   9 | LIST     |                       |                       |
 |  10 | CONST    | @10                   | @9                    |
 |  11 | ITER_NEW | @11                   | @10                   |
 |  12 | BRA      | @14                   |                       |
 |  13 | ITER_INC | @11                   |                       |
 |  14 | ITER_END | @11                   |                       |
 |  15 | BRT      | @22                   |                       |
 |  16 | VALUE    | foo                   |                       |
 |  17 | ADD      | @16                   | @11                   |
 |  18 | VALUE    | !                     |                       |
 |  19 | ADD      | @17                   | @18                   |
 |  20 | SEND     | @19                   |                       |
 |  21 | BRA      | @13                   |                       |
 |  22 | VALUE    | Done!                 |                       |
 |  23 | SEND     | @22                   |                       |
 |  24 | HALT     |                       |                       |
 +-----+----------+-----------------------+-----------------------+
foo 1!
foo 2!
foo 3!
foo 4!
Done!

Lolindrath 2010-08-14 04:05:30

ansaurus

tags:

views:

answers:

Strange problem with context free grammar

related questions