tags:

views:

104

answers:

2

I am experimenting with lex and yacc and have run into a strange issue, but I think it would be best to show you my code before detailing the issue. This is my lexer:

%{
#include <stdlib.h>
#include <string.h>
#include "y.tab.h"
void yyerror(char *);
%}

%%

[a-zA-Z]+ {
  yylval.strV = yytext;
  return ID;
}

[0-9]+      {
  yylval.intV = atoi(yytext);
  return INTEGER;
}

[\n] { return *yytext; }

[ \t]        ;

. yyerror("invalid character");

%%

int yywrap(void) {
  return 1;
}

This is my parser:

%{
#include <stdio.h>

int yydebug=1;
void prompt();
void yyerror(char *);
int yylex(void);
%}

%union {
  int intV;
  char *strV;
}

%token INTEGER ID

%%

program: program statement EOF { prompt(); }
       | program EOF { prompt(); }
       | { prompt(); }
       ;

args: /* empty */
    | args ID { printf(":%s ", $<strV>2); }
    ;

statement: ID args { printf("%s", $<strV>1); }
         | INTEGER { printf("%d", $<intV>1); }
;

EOF: '\n'

%%

void yyerror(char *s) {
  fprintf(stderr, "%s\n", s);
}

void prompt() {
  printf("> ");
}

int main(void) {
  yyparse();
  return 0;
}

A very simple language, consisting of no more than strings and integer and a basic REPL. Now, you'll note in the parser that args are output with a leading colon, the intention being that, when combined with the first pattern of the rule of the statement the interaction with the REPL would look something like this:

> aaa aa a
:aa :a aaa>

However, the interaction is this:

> aaa aa a
:aa :a aaa aa aa
>

Why does the token ID in the following rule

statement: ID args { printf("%s", $<strV>1); }
         | INTEGER { printf("%d", $<intV>1); }
;

have the semantic value of the total input string, newline included? How can my grammar be reworked so that the interaction I intended?

A: 

I think there is an associativity conflict between the args and statement productions. This is borne out by the (partial) output from the bison -v parser.output file:

Nonterminals, with rules where they appear

$accept (6)
    on left: 0
program (7)
    on left: 1 2 3, on right: 0 1 2
statement (8)
    on left: 4 5, on right: 1
args (9)
    on left: 6 7, on right: 4 7
EOF (10)
    on left: 8, on right: 1 2

Indeed, I'm having a hard time trying to figure out what your grammar is trying to accept. As a side note, I'd probably move your EOF production into the lexer as an EOL token; this will make resynchronizing on parse errors easier.

Better explanation of your intent would be helpful.

msw
I'm not really sure how to explain my intent better than the section on interaction does. I'm trying to construct a line-oriented REPL that identifies the first ID as a non-argument, and all the remainder as arguments. The goal is the output of the first interaction, rather than the second.
troutwine
+2  A: 

You have to preserve token strings as they are read if you want them to remain valid. I modified the statement rule to read:

statement: ID { printf("<%s> ", $<strV>1); } args { printf("%s", $<strV>1); }
         | INTEGER { printf("%d", $<intV>1); }
;

Then, with your input, I get the output:

> aaa aa a
<aaa> :aa :a aaa aa a
>

Note that at the time the initial ID is read, the token is exactly what you expected. But, because you did not preserve the token, the string has been modified by the time you get back to printing it after the args have been parsed.

Jonathan Leffler
Thank you, very much.
troutwine