views:

114

answers:

0

Hello, everyone!

I had already many problems with understanding, how ambiguous tokens can be handled elegantly (or somehow at all) in JavaCC. Let's take this example:

I want to parse XML processing instruction.

The format is: "<?" <target> <data> "?>": target is an XML name, data can be anything except ?>, because it's the closing tag.

So, lets define this in JavaCC:
(I use lexical states, in this case DEFAULT and PROC_INST)

TOKEN : <#NAME : (very-long-definition-from-xml-1.1-goes-here) >
TOKEN : <WSS : (" " | "\t")+ >   // WSS = whitespaces
<DEFAULT> TOKEN : {<PI_START : "<?" > : PROC_INST}
<PROC_INST> TOKEN : {<PI_TARGET : <NAME> >}
<PROC_INST> TOKEN : {<PI_DATA : ~[] >}   // accept everything
<PROC_INST> TOKEN : {<PI_END : "?>" > : DEFAULT}

Now the part which recognizes processing instructions:

void PROC_INSTR() : {} {
(
    <PI_START>
    (t=<PI_TARGET>){System.out.println("target: " + t.image);}
    <WSS>
    (t=<PI_DATA>){System.out.println("data: " + t.image);}
    <PI_END>
) {}
}

Let's test it with <?mytarget here-goes-some-data?>:

The target is recognized: "target: mytarget". But now I get my favorite JavaCC parsing error:

!!  procinstparser.ParseException: Encountered "" at line 1, column 15.
!!  Was expecting one of:
!!      

Encountered nothing? Was expecting nothing? Or what? Thank you, JavaCC!

I know, that I could use the MORE keyword of JavaCC, but this would give me the whole processing instruction as one token, so I'd had to parse/tokenize it further by myself. Why should I do that? Am I writing a parser that does not parse?

The problem is (i guess): hence <PI_DATA> recognizes "everything", my definition is wrong. I should tell JavaCC to recognize "everything except ?>" as processing instruction data.

But how can it be done?

NOTE: I can only exclude single characters using ~["a"|"b"|"c"], I can't exclude strings such as ~["abc"] or ~["?>"]. Another great anti-feature of JavaCC.

Thank you.