views:

552

answers:

4

I am teaching myself to use JavaCC in a hobby project, and have a simple grammar to write a parser for. Part of the parser includes the following:

TOKEN : { < DIGIT : (["0"-"9"]) > }
TOKEN : { < INTEGER : (<DIGIT>)+ > }
TOKEN : { < INTEGER_PAIR : (<INTEGER>){2} > }
TOKEN : { < FLOAT : (<NEGATE>)? <INTEGER> | (<NEGATE>)? <INTEGER>  "." <INTEGER>  | (<NEGATE>)? <INTEGER> "." | (<NEGATE>)? "." <INTEGER> > } 
TOKEN : { < FLOAT_PAIR : (<FLOAT>){2} > }
TOKEN : { < NUMBER_PAIR : <FLOAT_PAIR> | <INTEGER_PAIR> > }
TOKEN : { < NEGATE : "-" > }

When compiling with JavaCC I get the output:

Warning: Regular Expression choice : FLOAT_PAIR can never be matched as : NUMBER_PAIR

Warning: Regular Expression choice : INTEGER_PAIR can never be matched as : NUMBER_PAIR

I'm sure this is a simple concept but I don't understand the warning, being a novice in both parser generation and regular expressions.

What does this warning mean (in as-novice-as-you-can-get terms)?

A: 

I haven't used JavaCC, but it is possible that NUMBER_PAIR is ambiguous.

I think the problem comes down to the fact that the same exact thing can be matched as both FLOAT_PAIR and INTEGER_PAIR since FLOAT can match an INTEGER.

But this is just a guess having never seen the JavaCC syntax :)

Uri
I'm not sure about that, I changed Float so it couldn't match Integer - { < FLOAT : (<NEGATE>)? <INTEGER> "." <INTEGER> | (<NEGATE>)? <INTEGER> "." | (<NEGATE>)? "." <INTEGER> > }, and still received the warning. I'm surprised by that, cause what you said seemed perfectly logical :)
Grundlefleck
Hmm... I still think it's ambiguity-related but honestly, since I haven't tried JavaCC I am of no real use to you here... I'll defer and hope somebody who knows it will answer.
Uri
A: 

It probably means that for every FLOAT_PAIR you'll just get a FLOAT_PAIR token, never a NUMBER_PAIR token. The FLOAT_PAIR rule already matches all the input and JavaCC will not try to find further matching rules. That would be my interpretation, but I don't know JavaCC, so take it with a grain of salt.

Maybe you can specify somehow that NUMBER_PAIR is the main production and that you don't want to get any other tokens as results.

sth
+2  A: 

I don't know JavaCC, but I am a compiler engineer.

The FLOAT_PAIR rule is ambiguous. Consider the following text:

0.0

This could be FLOAT 0 followed by FLOAT .0; or it could be FLOAT 0. followed by FLOAT 0; both resulting in FLOAT_PAIR. Or it could be a single FLOAT 0.0.

More importantly, though, you are using lexical analysis with composition in a way that is never likely to work. Consider this number:

12345

This could be parsed as INTEGER 12, INTEGER 345 resulting in an INTEGER_PAIR. Or it could be parsed as INTEGER 123, INTEGER 45, another INTEGER_PAIR. Or it could be INTEGER 12345, another token. The problem exists because you are not requiring white space between the lexical elements of the INTEGER_PAIR (or FLOAT_PAIR).

You should almost never try to handle pairs like this in the lexer. Instead, you should handle plain numbers (INTEGER and FLOAT) as tokens, and handle things like negation and pairing in the parser, where whitespace has been dealt with and stripped.

(For example, how are you going to process "----42"? This is a valid expression in most programming languages, which will correctly calculate multiple negations, but would not be handled by your lexer.)

Also, be aware that single-digit integers in your lexer will not be matched as INTEGER, they will come out as DIGIT. I don't know the correct syntax for JavaCC to fix that for you, though. What you want is to define DIGIT not as a token, but simply something you can use in the definitions of other tokens; alternatively, embed the definition of DIGIT ([0-9]) directly wherever you are using DIGIT in your rules.

Barry Kelly
A: 

Thanks to Barry Kelly's answer, the solution I've come up with is:

    SKIP : { < #TO_SKIP : " " | "\t" > }
    TOKEN : { < #DIGIT : (["0"-"9"]) > }
    TOKEN : { < #DIGITS : (<DIGIT>)+ > }
    TOKEN : { < INTEGER : <DIGITS> > }
    TOKEN : { < INTEGER_PAIR : (<INTEGER>) (<TO_SKIP>)+ (<INTEGER>) > }
    TOKEN : { < FLOAT : (<NEGATE>)?<DIGITS>"."<DIGITS> | (<NEGATE>)?"."<DIGITS> > } 
    TOKEN : { < FLOAT_PAIR : (<FLOAT>) (<TO_SKIP>)+ (<FLOAT>) > }
    TOKEN : { < #NUMBER : <FLOAT> | <INTEGER> > }
    TOKEN : { < NUMBER_PAIR : (<NUMBER>) (<TO_SKIP>)+ (<NUMBER>) >}
    TOKEN : { < NEGATE : "-" > }

I had completely forgot to include the space which is used to separate the two tokens, I've also used the '#' symbol which stops the tokens being matched, and is just used in the definition of other tokens. The above is compiled by JavaCC without warning or error.

However, as noted by Barry, there are reasons against doing this.

Grundlefleck