views:

199

answers:

2

I have a relatively simple lex/flex file and have been running it with flex's debug flag to make sure it's tokenizing properly. Unfortunately, I'm always running into one of two problems - either the program the flex generates stops just gives up silently after a couple of tokens, or the rule I'm using to recognize characters and strings isn't called and the default rule is called instead.

Can someone point me in the right direction? I've attached my flex file and sample input / output.

Edit: I've found that the generated lexer stops after a specific rule: "cdr". This is more detailed, but also much more confusing. I've posted a shorted modified lex file.

/* lex file*/
%option noyywrap
%option nodefault

%{
       enum tokens{
                CDR,
                CHARACTER,
                SET
        };
%}

%%

"cdr"                                               { return CDR; }
"set"                                               { return SET; }

[ \t\r\n]                                           /*Nothing*/
[a-zA-Z0-9\\!@#$%^&*()\-_+=~`:;"'?<>,\.]      { return CHARACTER; }

%%

Sample input:

set c cdra + cdr b + () ;

Complete output from running the input through the generated parser:

--(end of buffer or a NUL)
--accepting rule at line 16 ("set")
--accepting rule at line 18 (" ")
--accepting rule at line 19 ("c")
--accepting rule at line 18 (" ")
--accepting rule at line 15 ("cdr")

Any thoughts? The generated program is giving up after half of the input! (for reference, I'm doing input by redirecting the contents of a file to the generated program).

A: 

This rule

[-+]?([0-9*\.?[0-9]+|[0-9]+\.)([Ee][-+]?[0-9]+)? 
          |

seems to be missing a closing bracket just after the first 0-9, I added a | below where I think it should be. I couldn't begin to guess how flex would respond to that.

The rule I usually use for symbol names is [a-zA-Z$_], this is like your unquoted strings except that I usually allow numbers inside symbols as long as the symbol doesn't start with a number.

[a-zA-Z$_]([a-zA-Z$_]|[0-9])*

A characters is just a short symbol. I don't think it needs to have its own rule, but if it does, then you need to insure that the string rule requires at least 2 characters.

[a-zA-Z$_]([a-zA-Z$_]|[0-9])+
John Knoeller
Fixed the mismatched square brackets, but no luck. However, I did manage to duplicate the issue with a shorter set of rules.
Zxaos
+1  A: 

When generating a lexer that's standalone (that is, not one with tokens that are defined in bison/yacc, you typically write an enum at the top of the file defining your tokens. However, the main loop of a lex program, including the main loop generated by default, looks something like this:

while( token = yylex() ){
    ...

This is fine, until your lexer matches the rule that appears first in the enum - in this specific case CDR. Since enums by default start at zero, this causes the while loop to end. Renumbering your enum - will solve the issue.

enum tokens{
            CDR = 1,
            CHARACTER,
            SET
    };

Short version: when defining tokens by hand for a lexer, start with 1 not 0.

Zxaos