tags:

views:

335

answers:

3

OK, so here is the deal.

In my language I have some commands, say

XYZ 3 5
GGB 8 9
HDH 8783 33

And in my Lex file

XYZ { return XYZ; }
GGB { return GGB; }
HDH { return HDH; }
[0-9]+ { yylval.ival = atoi(yytext); return NUMBER; }
\n  { return EOL; }

In my yacc file

start : commands
    ;

commands : command
         | command EOL commands
    ;

    command : xyz
            | ggb
            | hdh
    ;

    xyz : XYZ NUMBER NUMBER { /* Do something with the numbers */ }
       ;

    etc. etc. etc. etc.

My question is, how can I get the entire text

XYZ 3 5
GGB 8 9
HDH 8783 33

Into commands while still returning the NUMBERs?

Also when my Lex returns a STRING [0-9a-zA-Z]+, and I want to do verification on it's length, should I do it like

rule: STRING STRING { if (strlen($1) < 5 ) /* Do some shit else error */ }

or actually have a token in my Lex that returns different tokens depending on length?

+1  A: 

If I've understood your first question correctly, you can have semantic actions like

{ $$ = makeXYZ($2, $3); }

which will allow you to build the value of command as you want.

For your second question, the borders between lexical analysis and grammatical analysis and between grammatical analysis and semantic analysis aren't hard and well fixed. Moving them is a trade-off between factors like easiness of description, clarity of error messages and robustness in presence of errors. Considering the verification of string length, the likelihood of an error occurring is quite high and the error message if it is handled by returning different terminals for different length will probably be not clear. So if it is possible -- that depend on the grammar -- I'd handle it in the semantic analysis phase, where the message can easily be tailored.

AProgrammer
A: 

As you use yylval.ival you already have union with ival field in your YACC source, like this:

%union {
    int ival;
}

Now you specify token type, like this:

%token <ival> NUMBER

So now you can access ival field simply for NUMBER token as $1 in your rules, like

xyz : XYZ NUMBER NUMBER { printf("XYZ %d %d", $2, $3); }

For your second question I'd define union like this:

%union {
    char*   strval;
    int     ival;
}

and in you LEX source specify token types

%token <strval> STRING;
%token <ival> NUMBER;

So now you can do things like

foo : STRING NUMBER { printf("%s (len %d) %d", $1, strlen($1), $2); }
qrdl
+1  A: 

If you arrange for your lexical analyzer (yylex()) to store the whole string in some variable, then your code can access it. The communication with the parser proper will be through the normal mechanisms, but there's nothing that says you can't also have another variable lurking around (probably a file static variable - but beware multithreading) that stores the whole input line before it is dissected.

Jonathan Leffler
I guess maybe I wasn't that clear, but I have something likeHEADERBLOCK1STUFFBLOCK1ENDBLOCK2STUFFBLOCK2ENDWhere BLOCK1 and BLOCK2 match the same rule. What I need is the entire text of BLOCK1 at the time that I match BLOCK1. The simplest, but annoying way, is just to have all of the rules have a value of string, i.e. from my example%type <sval> xyzThen for every rule I would have to putxyz: XYZ NUMBER NUMBER { $$ = "XYZ" + $2 + $3; }Which overtime can get very annoying.
DevDevDev