tags:

views:

506

answers:

3

I'm attempting to write a parser in JavaCC that can recognize a language that has some ambiguity at the token level. In this particular case the language supports the "/" token by itself as a division operator while it also supports regular expression literals.

Consider the following JavaCC grammar:

TOKEN : 
{
    ...
    < VAR : "var" > |
    < DIV : "/" > |
    < EQUALS : "=" > |
    < SEMICOLON : ";" > |
    ...
}

TOKEN :
{
    < IDENTIFIER : <IDENTIFIER_START> (<IDENTIFIER_START> | <IDENTIFIER_CHAR>)* > |
    < #IDENTIFIER_START : ( [ "$","_","A"-"Z","a"-"z" ] )> |
    < #IDENTIFIER_CHAR : ( [ "$","_","A"-"Z","a"-"z","0"-"9" ] ) >  |

    < REGEX_LITERAL : ("/" <REGEX_BODY> "/" ( <REGEX_FLAGS> )? ) > |
    < #REGEX_BODY : ( <REGEX_FIRST_CHAR> <REGEX_CHARS> ) > |
    < #REGEX_CHARS : ( <REGEX_CHAR> )* > |
    < #REGEX_FIRST_CHAR : ( ~["\r", "\n", "*", "/", "\\"] | <BACKSLASH_SEQUENCE> ) > |
    < #REGEX_CHAR : ( ~[ "\r", "\n", "/", "\\" ] | <BACKSLASH_SEQUENCE> ) > |
    < #BACKSLASH_SEQUENCE : ("\\" ~[ "\r", "\n"] ) > |
    < #REGEX_FLAGS : ( <IDENTIFIER_CHAR> )* >

}

Given the following code:

var y = a/b/c;

Two different sets of tokens could be generated. The token stream should be either:

<VAR> <IDENTIFIER> <EQUALS> <IDENTIFIER> <DIV> <IDENTIFIER> <DIV> <SEMICOLON>

or

<VAR> <IDENTIFIER> <EQUALS> <IDENTIFIER> <REGEX_LITERAL> <SEMICOLON>

How can I ensure that that TokenManager generates the token stream that I expect for this case?

A: 

as far as i remember (i worked with JavaCC sometime back)

the order in which you write each rule is the order in which it would be parsed, so write your rules in an order which would always generate the expression that you want.

nicko
A: 

Since JavaScript/EcmaScript does the same thing (that is, it contains regex literals and a divide operator that look just like those in your examples) you might want to look for an existing JavaCC grammar to learn from. I found one linked to from this blog entry, there may be others.

Laurence Gonsalves
+1  A: 

JavaCC will always consume the largest token available and there is no way to configure it otherwise. The only way to accomplish this is by adding a lexical state, in case say IGNORE_REGEX, that excludes the token, in this case <REGEX_LITERAL>. Then, when a token is recognized that cannot be followed by <REGEX_LITERAL> the lexical state must be switched to IGNORE_REGEX.

With the input:

var y = a/b/c

The following would occur:

  1. <VAR> is consumed, lexical state is set to DEFAULT
  2. <IDENTIFIER> is consumed, lexical state is set to IGNORE_REGEX
  3. <EQUALS> is consumed, lexical state is set to DEFAULT
  4. <IDENTIFIER> is consumed, lexical state is set to IGNORE_REGEX

    At this point, there is an ambiguity in the grammar, either a <DIV> or a <REGEX_LITERAL> will be consumed. Since the lexical state is IGNORE_REGEX and that state does not match <REGEX_LITERAL> a <DIV> will be consumed.

  5. <DIV> is consumed, lexical state is set to DEFAULT

  6. <IDENTIFIER> is consumed, lexical state is set to IGNORE_REGEX
  7. <DIV> is consumed, lexical state is set to DEFAULT
  8. <IDENTIFIER> is consumed, lexical state is set to IGNORE_REGEX
Bryan Kyle