views:

39

answers:

2

I'm trying to match measurements in English input text, using Antlr 3.2 and Java1.6. I've got lexical rules like the following:

fragment
MILLIMETRE
    :   'millimetre' | 'millimetres'
    |   'millimeter' | 'millimeters'
    |   'mm'
    ;

MEASUREMENT
    :   MILLIMETRE | CENTIMETRE | ... ;

I'd like to be able to accept any combination of upper- and lowercase input and - more importantly - just return a single lexical token for all the variants of MILLIMETRE. But at the moment, my AST contains 'millimetre', 'millimeters', 'mm' etc. just as in the input text.

After reading http://www.antlr.org/wiki/pages/viewpage.action?pageId=1802308, I think I need to do something like the following:

tokens {
    T_MILLIMETRE;
}

fragment
MILLIMETRE
    :   ('millimetre' | 'millimetres'
    |   'millimeter' | 'millimeters'
    |   'mm') { $type = T_MILLIMETRE; }
    ;

However, when I do this, I get the following compiler errors in the Java code generated by Antlr:

cannot find symbol
_type = T_MILLIMETRE;

I tried the following instead:

MEASUREMENT
    :   MILLIMETRE  { $type = T_MILLIMETRE; }
    |   ...

but then MEASUREMENT is not matched anymore.

The more obvious solution with a rewrite rule:

MEASUREMENT
    :   MILLIMETRE  -> ^(T_MILLIMETRE MILLIMETRE)
    |   ...

causes an NPE:

java.lang.NullPointerException at org.antlr.grammar.v2.DefineGrammarItemsWalker.alternative(DefineGrammarItemsWalker.java:1555).

Making MEASUREMENT into a parser rule gives me the dreaded "The following token definitions can never be matched because prior tokens match the same input" error.

By creating a parser rule

measurement :  T_MILLIMETRE | ...

I get the warning "no lexer rule corresponding to token: T_MILLIMETRE". Antlr runs though, but it still gives me the input text in the AST and not T_MILLIMETRE.

I'm obviously not yet seeing the world the way Antlr does. Can anyone give me any hints or advice please?

Steve

A: 

Here's a way to do that:

grammar Measurement;

options {
  output=AST;
}

tokens {
  ROOT;
  MM;
  CM;
}

parse
  :  measurement+ EOF -> ^(ROOT measurement+)
  ;

measurement
  :  Number MilliMeter -> ^(MM Number)
  |  Number CentiMeter -> ^(CM Number)
  ;

Number
  :  '0'..'9'+
  ;

MilliMeter
  :  'millimetre'
  |  'millimetres'
  |  'millimeter'
  |  'millimeters'
  |  'mm'
  ;

CentiMeter
  :  'centimetre'
  |  'centimetres'
  |  'centimeter'
  |  'centimeters'
  |  'cm'
  ;

Space
  :  (' ' | '\t' | '\r' | '\n'){$channel=HIDDEN;}
  ;

It can be tested with the following class:

import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;

public class Main {
    public static void main(String[] args) throws Exception {
        ANTLRStringStream in = new ANTLRStringStream("12 millimeters 3 mm 456 cm");
        MeasurementLexer lexer = new MeasurementLexer(in);
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        MeasurementParser parser = new MeasurementParser(tokens);
        MeasurementParser.parse_return returnValue = parser.parse();
        CommonTree tree = (CommonTree)returnValue.getTree();
        DOTTreeGenerator gen = new DOTTreeGenerator();
        StringTemplate st = gen.toDOT(tree);
        System.out.println(st);
    }
}

which produces the following DOT file:

digraph {

    ordering=out;
    ranksep=.4;
    bgcolor="lightgrey"; node [shape=box, fixedsize=false, fontsize=12, fontname="Helvetica-bold", fontcolor="blue"
        width=.25, height=.25, color="black", fillcolor="white", style="filled, solid, bold"];
    edge [arrowsize=.5, color="black", style="bold"]

  n0 [label="ROOT"];
  n1 [label="MM"];
  n1 [label="MM"];
  n2 [label="12"];
  n3 [label="MM"];
  n3 [label="MM"];
  n4 [label="3"];
  n5 [label="CM"];
  n5 [label="CM"];
  n6 [label="456"];

  n0 -> n1 // "ROOT" -> "MM"
  n1 -> n2 // "MM" -> "12"
  n0 -> n3 // "ROOT" -> "MM"
  n3 -> n4 // "MM" -> "3"
  n0 -> n5 // "ROOT" -> "CM"
  n5 -> n6 // "CM" -> "456"

}

which corresponds to the tree:

alt text

(image created by http://graph.gafol.net/)

EDIT

Note that the following:

measurement
  :  Number m=MilliMeter {System.out.println($m.getType() == MeasurementParser.MilliMeter);}
  |  Number CentiMeter
  ;

will always print true, regardless if the "contents" of the (millimeter) tokens are mm, millimetre, millimetres, ...

Bart Kiers
Thanks for your response, Bart. I was aware of this possibility. The difference is that I'm trying to solve the problem at the lexical level, whereas you propose a syntactic rule. Your way is presumably the correct Antlr way. My experience with this problem is that rewrite rules only work with syntactic rules, and not with lexical rules. I'm solving the problem in my solution at the moment by post-processing the results in my Java code, but I should perhaps reconsider what I do in Antlr at the lexical level and what I do at the syntactic level.
Stephen Winnall
@Stephen, ah okay, I see what you mean. But in my example, the type (for millimeter) will always be `MilliMeter` (see my **EDIT**). So I'm not entirely sure what you're after.
Bart Kiers
You made me think, Bart. I was approaching the problem the wrong way. I was trying to do effectively bottom-up recognition by making the lexical analysis context-sensitive. This meant that I quickly reached the limits of what Antlr could do, since it is a top-down tool. I've shifted a lot of the analysis into the syntax now (like in your example), and everything's becoming easier. I think one has to be very aware of the difference between lexical rules and syntactical rules in Antlr, even if they look very similar. Not everything that syntactic rules can do is possible with lexical ones.
Stephen Winnall
@Stephen, yeah, very true. It can be quite tricky to decide what to put in the lexer and what in the parser, especially when the language becomes more complex. Best of luck!
Bart Kiers
A: 

Note that fragment rules only "live" inside the lexer and cease to exist in the parser. For example:

grammar Measurement;

options {
  output=AST;
}

parse
  :  (m=MEASUREMENT {
       String contents = $m.text;
       boolean isMeasurementType = $m.getType() == MeasurementParser.MEASUREMENT;
       System.out.println("contents="+contents+", isMeasurementType="+isMeasurementType);
     })+ EOF
  ;

MEASUREMENT
  :  MILLIMETRE
  ;

fragment
MILLIMETRE
  :  'millimetre' 
  |  'millimetres'
  |  'millimeter' 
  |  'millimeters'
  |  'mm'
  ;

SPACE
  :  (' ' | '\t' | '\r' | '\n'){$channel=HIDDEN;}
  ;

with input text:

"millimeters mm"

will print:

contents=millimeters, isMeasurementType=true
contents=mm, isMeasurementType=true

in other words: the type MILLIMETRE does not exist, they're all of type MEASUREMENT.

Bart Kiers