views:

200

answers:

1

Hello, everyone!

I need to make JavaCC aware of a context (current parent token), and depending on that context, expect different token(s) to occur.

Consider the following pseudo-code:

TOKEN <abc> { "abc*" } // recognizes "abc", "abcd", "abcde", ...
TOKEN <abcd> { "abcd*" } // recognizes "abcd", "abcde", "abcdef", ...

TOKEN <element1> { "element1" "[" expectOnly(<abc>) "]" }
TOKEN <element2> { "element2" "[" expectOnly(<abcd>) "]" }
...

So when the generated parser is "inside" a token named "element1" and it encounter "abcdef" it recognizes it as <abc>, but when its "inside" a token named "element2" it recognizes the same string as <abcd>.

element1 [ abcdef ] // aha! it can only be <abc>
element2 [ abcdef ] // aha! it can only be <abcd>

If I'm not wrong, it would behave similar to more complex DTD definitions of an XML file.

So, how can one specify, in which "context" which token(s) are valid/expected?

NOTE: It would be not enough for my real case to define a kind of "hierarchy" of tokens, so that "abcdef" is always first matched against <abcd> and than <abc>. I really need context-aware tokens.

A: 

OK, it seems that you need a technique called lookahead here. Here is a very good tutorial: Lookahead tutorial

My first attempt was wrong then, but as it works for distinct tokens which define a context I'll leave it here (Maybe it's useful for somebody ;o)).


Let's say we want to have some kind of markup language. All we want to "markup" are:

  • Expressions consisting of letters (abc...zABC...Z) and whitespaces --> words
  • Expressions consisting of numbers (0-9) --> numbers

We want to enclose words in tags and numbers in tags. So if i got you right that is what you want to do: If you're in the word context (between word tags) the compiler should expect letters and whitespaces, in the number context it expects numbers.

I created the file WordNumber.jj which defines the grammar and the parser to be generated:

options
{
    LOOKAHEAD= 1;

    CHOICE_AMBIGUITY_CHECK = 2;
    OTHER_AMBIGUITY_CHECK = 1;
    STATIC = true;
    DEBUG_PARSER = false;
    DEBUG_LOOKAHEAD = false;
    DEBUG_TOKEN_MANAGER = false;
    ERROR_REPORTING = true;
    JAVA_UNICODE_ESCAPE = false;
    UNICODE_INPUT = false;
    IGNORE_CASE = false;
    USER_TOKEN_MANAGER = false;
    USER_CHAR_STREAM = false;
    BUILD_PARSER = true;
    BUILD_TOKEN_MANAGER = true;
    SANITY_CHECK = true;
    FORCE_LA_CHECK = false;
}

PARSER_BEGIN(WordNumberParser)

/** Model-tree Parser */
public class WordNumberParser
{
    /** Main entry point. */
    public static void main(String args []) throws ParseException
    {
        WordNumberParser parser = new WordNumberParser(System.in);
        parser.Input();
    }
}

PARSER_END(WordNumberParser)

SKIP :
{
    " "
|   "\n"
|   "\r"
|   "\r\n"
|   "\t"
}

TOKEN :
{
    < WORD_TOKEN : (["a"-"z"] | ["A"-"Z"] | " " | "." | ",")+ > |
    < NUMBER_TOKEN : (["0"-"9"])+ >
}


/** Root production. */
void Input() :
{}
{
    ( WordContext() | NumberContext() )* < EOF >
}

/** WordContext production. */
void WordContext() :
{}
{
    "<WORDS>" (< WORD_TOKEN >)+ "</WORDS>"
}

/** NumberContext production. */
void NumberContext() :
{}
{
    "<NUMBER>" (< NUMBER_TOKEN >)+ "</NUMBER>"
}

You can test it with a file like that:

<WORDS>This is a sentence. As you can see the parser accepts it.</WORDS>
<WORDS>The answer to life, universe and everything is</WORDS><NUMBER>42</NUMBER>
<NUMBER>This sentence will make the parser sad. Do not make the parser sad.</NUMBER>

The Last line will cause the parser to throw an exception like this:

Exception in thread "main" ParseException: Encountered " <WORD_TOKEN> "This sentence will make the parser sad. Do not make the parser sad. "" at line 3, column 9. Was expecting: <NUMBER_TOKEN> ...

That is because the parser did not find what it expected.

I hope that helps.

Cheers!

P.S.: The parser can't "be" inside a token as a token is a terminal symbol (correct me if I'm wrong) which can't be replaced by production rules any further. So all the context aspects have to be placed inside a production rule (non terminal) like "WordContext" in my example.

das_weezul
Thank you very much for the example, but it is not what my problem is. There is no problem at all when used tokens are distinguishable, as in your case (the one is enclosed in `<NUMBER>...</NUMBER>`, the other is enclosed in `<WORD>...</WORD>`). In contrast to that, in my case, I have tokens which would *both* match certain input.
java.is.for.desktop
@java.is.for.desktop:Ah OK, sorry. I think you can use "lookahead" then. Check my edited post for a link ;o)
das_weezul