views:

47

answers:

1

I need to make a scanner in lex/flex to find tokens and a parser in yacc/bison to process those tokens based on the following grammar. When I was in the middle of making the scanner, it appeared to me that variables, functions, and arrays in this language can only have the name 'ID'. Am I misreading this yacc file?

/* C-Minus BNF Grammar */

%token ELSE
%token IF
%token INT
%token RETURN
%token VOID
%token WHILE

%token ID
%token NUM

%token LTE
%token GTE
%token EQUAL
%token NOTEQUAL
%%

program : declaration_list ;

declaration_list : declaration_list declaration | declaration ;

declaration : var_declaration | fun_declaration ;

var_declaration : type_specifier ID ';'
                | type_specifier ID '[' NUM ']' ';' ;

type_specifier : INT | VOID ;

fun_declaration : type_specifier ID '(' params ')' compound_stmt ;

params : param_list | VOID ;

param_list : param_list ',' param
           | param ;

param : type_specifier ID | type_specifier ID '[' ']' ;

compound_stmt : '{' local_declarations statement_list '}' ;

local_declarations : local_declarations var_declaration
                   | /* empty */ ;

statement_list : statement_list statement
               | /* empty */ ;

statement : expression_stmt
          | compound_stmt
          | selection_stmt
          | iteration_stmt
          | return_stmt ;

expression_stmt : expression ';'
                | ';' ;

selection_stmt : IF '(' expression ')' statement
               | IF '(' expression ')' statement ELSE statement ;

iteration_stmt : WHILE '(' expression ')' statement ;

return_stmt : RETURN ';' | RETURN expression ';' ;

expression : var '=' expression | simple_expression ;

var : ID | ID '[' expression ']' ;

simple_expression : additive_expression relop additive_expression
                  | additive_expression ;

relop : LTE | '<' | '>' | GTE | EQUAL | NOTEQUAL ;

additive_expression : additive_expression addop term | term ;

addop : '+' | '-' ;

term : term mulop factor | factor ;

mulop : '*' | '/' ;

factor : '(' expression ')' | var | call | NUM ;

call : ID '(' args ')' ;

args : arg_list | /* empty */ ;

arg_list : arg_list ',' expression | expression ;
+1  A: 

ID is just the terminal type returned by the lexer. The idea is that, in the case of variable names (and numbers), other returned information will specify the name (or number). In C-like psuedo-code, the lexer is doing something like:

char *tok = tokenise();
if (!strcmp(tok, "int"))
{
    return INT;
}
else if (is_name(tok))
{
    strcpy(parser.name, tok);
    return ID;
}
else if (is_number(tok))
{
    parser.number = atoi(tok);
    return NUM;
}
...

The parser receives the terminal type (INT, ID, NUM, etc.) and this is sufficient information for it to apply the grammar rules. The actions in the rules can then include the extra information (parser.name, parser.number, etc.) either directly or when constructing the AST.

Edmund
How can you tell that INT is taken literally, but not ID?
Phenom
That is an extremely good question. There is no way to tell from the grammar. I was guessing based on the extreme similarity of your grammar to several others I have seen with their accompanying lexers. Though since your question seems to be "how would this work in the real world (with a realistic lexer)?" I think my assumption is reasonable.
Edmund