I'm trying to match measurements in English input text, using Antlr 3.2 and Java1.6. I've got lexical rules like the following:
fragment
MILLIMETRE
: 'millimetre' | 'millimetres'
| 'millimeter' | 'millimeters'
| 'mm'
;
MEASUREMENT
: MILLIMETRE | CENTIMETRE | ... ;
I'd like to be able to accept any combination of upper- and lowercase input and - more importantly - just return a single lexical token for all the variants of MILLIMETRE. But at the moment, my AST contains 'millimetre', 'millimeters', 'mm' etc. just as in the input text.
After reading http://www.antlr.org/wiki/pages/viewpage.action?pageId=1802308, I think I need to do something like the following:
tokens {
T_MILLIMETRE;
}
fragment
MILLIMETRE
: ('millimetre' | 'millimetres'
| 'millimeter' | 'millimeters'
| 'mm') { $type = T_MILLIMETRE; }
;
However, when I do this, I get the following compiler errors in the Java code generated by Antlr:
cannot find symbol
_type = T_MILLIMETRE;
I tried the following instead:
MEASUREMENT
: MILLIMETRE { $type = T_MILLIMETRE; }
| ...
but then MEASUREMENT is not matched anymore.
The more obvious solution with a rewrite rule:
MEASUREMENT
: MILLIMETRE -> ^(T_MILLIMETRE MILLIMETRE)
| ...
causes an NPE:
java.lang.NullPointerException at org.antlr.grammar.v2.DefineGrammarItemsWalker.alternative(DefineGrammarItemsWalker.java:1555).
Making MEASUREMENT into a parser rule gives me the dreaded "The following token definitions can never be matched because prior tokens match the same input" error.
By creating a parser rule
measurement : T_MILLIMETRE | ...
I get the warning "no lexer rule corresponding to token: T_MILLIMETRE". Antlr runs though, but it still gives me the input text in the AST and not T_MILLIMETRE.
I'm obviously not yet seeing the world the way Antlr does. Can anyone give me any hints or advice please?
Steve