ansaurus

Question

JavaCC: How can I specify which token(s) are expected in certain context?

Answer 1

A:

OK, it seems that you need a technique called lookahead here. Here is a very good tutorial: Lookahead tutorial

My first attempt was wrong then, but as it works for distinct tokens which define a context I'll leave it here (Maybe it's useful for somebody ;o)).

Let's say we want to have some kind of markup language. All we want to "markup" are:

Expressions consisting of letters (abc...zABC...Z) and whitespaces --> words
Expressions consisting of numbers (0-9) --> numbers

We want to enclose words in tags and numbers in tags. So if i got you right that is what you want to do: If you're in the word context (between word tags) the compiler should expect letters and whitespaces, in the number context it expects numbers.

I created the file WordNumber.jj which defines the grammar and the parser to be generated:

options
{
    LOOKAHEAD= 1;

    CHOICE_AMBIGUITY_CHECK = 2;
    OTHER_AMBIGUITY_CHECK = 1;
    STATIC = true;
    DEBUG_PARSER = false;
    DEBUG_LOOKAHEAD = false;
    DEBUG_TOKEN_MANAGER = false;
    ERROR_REPORTING = true;
    JAVA_UNICODE_ESCAPE = false;
    UNICODE_INPUT = false;
    IGNORE_CASE = false;
    USER_TOKEN_MANAGER = false;
    USER_CHAR_STREAM = false;
    BUILD_PARSER = true;
    BUILD_TOKEN_MANAGER = true;
    SANITY_CHECK = true;
    FORCE_LA_CHECK = false;
}

PARSER_BEGIN(WordNumberParser)

/** Model-tree Parser */
public class WordNumberParser
{
    /** Main entry point. */
    public static void main(String args []) throws ParseException
    {
        WordNumberParser parser = new WordNumberParser(System.in);
        parser.Input();
    }
}

PARSER_END(WordNumberParser)

SKIP :
{
    " "
|   "\n"
|   "\r"
|   "\r\n"
|   "\t"
}

TOKEN :
{
    < WORD_TOKEN : (["a"-"z"] | ["A"-"Z"] | " " | "." | ",")+ > |
    < NUMBER_TOKEN : (["0"-"9"])+ >
}


/** Root production. */
void Input() :
{}
{
    ( WordContext() | NumberContext() )* < EOF >
}

/** WordContext production. */
void WordContext() :
{}
{
    "<WORDS>" (< WORD_TOKEN >)+ "</WORDS>"
}

/** NumberContext production. */
void NumberContext() :
{}
{
    "<NUMBER>" (< NUMBER_TOKEN >)+ "</NUMBER>"
}

You can test it with a file like that:

<WORDS>This is a sentence. As you can see the parser accepts it.</WORDS>
<WORDS>The answer to life, universe and everything is</WORDS><NUMBER>42</NUMBER>
<NUMBER>This sentence will make the parser sad. Do not make the parser sad.</NUMBER>

The Last line will cause the parser to throw an exception like this:

Exception in thread "main" ParseException: Encountered " <WORD_TOKEN> "This sentence will make the parser sad. Do not make the parser sad. "" at line 3, column 9. Was expecting: <NUMBER_TOKEN> ...

That is because the parser did not find what it expected.

I hope that helps.

Cheers!

P.S.: The parser can't "be" inside a token as a token is a terminal symbol (correct me if I'm wrong) which can't be replaced by production rules any further. So all the context aspects have to be placed inside a production rule (non terminal) like "WordContext" in my example.

das_weezul 2010-05-05 11:47:55

Thank you very much for the example, but it is not what my problem is. There is no problem at all when used tokens are distinguishable, as in your case (the one is enclosed in `<NUMBER>...</NUMBER>`, the other is enclosed in `<WORD>...</WORD>`). In contrast to that, in my case, I have tokens which would *both* match certain input.

java.is.for.desktop 2010-05-05 18:11:46

@java.is.for.desktop:Ah OK, sorry. I think you can use "lookahead" then. Check my edited post for a link ;o)

das_weezul 2010-05-05 20:49:33

ansaurus

tags:

views:

answers:

JavaCC: How can I specify which token(s) are expected in certain context?

related questions