ansaurus

Question

Answer 1

A:

Here's a way to do that:

grammar Measurement;

options {
  output=AST;
}

tokens {
  ROOT;
  MM;
  CM;
}

parse
  :  measurement+ EOF -> ^(ROOT measurement+)
  ;

measurement
  :  Number MilliMeter -> ^(MM Number)
  |  Number CentiMeter -> ^(CM Number)
  ;

Number
  :  '0'..'9'+
  ;

MilliMeter
  :  'millimetre'
  |  'millimetres'
  |  'millimeter'
  |  'millimeters'
  |  'mm'
  ;

CentiMeter
  :  'centimetre'
  |  'centimetres'
  |  'centimeter'
  |  'centimeters'
  |  'cm'
  ;

Space
  :  (' ' | '\t' | '\r' | '\n'){$channel=HIDDEN;}
  ;

It can be tested with the following class:

import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;

public class Main {
    public static void main(String[] args) throws Exception {
        ANTLRStringStream in = new ANTLRStringStream("12 millimeters 3 mm 456 cm");
        MeasurementLexer lexer = new MeasurementLexer(in);
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        MeasurementParser parser = new MeasurementParser(tokens);
        MeasurementParser.parse_return returnValue = parser.parse();
        CommonTree tree = (CommonTree)returnValue.getTree();
        DOTTreeGenerator gen = new DOTTreeGenerator();
        StringTemplate st = gen.toDOT(tree);
        System.out.println(st);
    }
}

which produces the following DOT file:

digraph {

    ordering=out;
    ranksep=.4;
    bgcolor="lightgrey"; node [shape=box, fixedsize=false, fontsize=12, fontname="Helvetica-bold", fontcolor="blue"
        width=.25, height=.25, color="black", fillcolor="white", style="filled, solid, bold"];
    edge [arrowsize=.5, color="black", style="bold"]

  n0 [label="ROOT"];
  n1 [label="MM"];
  n1 [label="MM"];
  n2 [label="12"];
  n3 [label="MM"];
  n3 [label="MM"];
  n4 [label="3"];
  n5 [label="CM"];
  n5 [label="CM"];
  n6 [label="456"];

  n0 -> n1 // "ROOT" -> "MM"
  n1 -> n2 // "MM" -> "12"
  n0 -> n3 // "ROOT" -> "MM"
  n3 -> n4 // "MM" -> "3"
  n0 -> n5 // "ROOT" -> "CM"
  n5 -> n6 // "CM" -> "456"

}

which corresponds to the tree:

alt text

(image created by http://graph.gafol.net/)

EDIT

Note that the following:

measurement
  :  Number m=MilliMeter {System.out.println($m.getType() == MeasurementParser.MilliMeter);}
  |  Number CentiMeter
  ;

will always print true, regardless if the "contents" of the (millimeter) tokens are mm, millimetre, millimetres, ...

Bart Kiers 2010-09-29 11:08:40

Thanks for your response, Bart. I was aware of this possibility. The difference is that I'm trying to solve the problem at the lexical level, whereas you propose a syntactic rule. Your way is presumably the correct Antlr way. My experience with this problem is that rewrite rules only work with syntactic rules, and not with lexical rules. I'm solving the problem in my solution at the moment by post-processing the results in my Java code, but I should perhaps reconsider what I do in Antlr at the lexical level and what I do at the syntactic level.

Stephen Winnall 2010-09-30 14:09:33

@Stephen, ah okay, I see what you mean. But in my example, the type (for millimeter) will always be `MilliMeter` (see my **EDIT**). So I'm not entirely sure what you're after.

Bart Kiers 2010-09-30 14:20:19

You made me think, Bart. I was approaching the problem the wrong way. I was trying to do effectively bottom-up recognition by making the lexical analysis context-sensitive. This meant that I quickly reached the limits of what Antlr could do, since it is a top-down tool. I've shifted a lot of the analysis into the syntax now (like in your example), and everything's becoming easier. I think one has to be very aware of the difference between lexical rules and syntactical rules in Antlr, even if they look very similar. Not everything that syntactic rules can do is possible with lexical ones.

Stephen Winnall 2010-10-01 15:12:34

@Stephen, yeah, very true. It can be quite tricky to decide what to put in the lexer and what in the parser, especially when the language becomes more complex. Best of luck!

Bart Kiers 2010-10-01 15:44:02

Answer 2

A:

Note that fragment rules only "live" inside the lexer and cease to exist in the parser. For example:

grammar Measurement;

options {
  output=AST;
}

parse
  :  (m=MEASUREMENT {
       String contents = $m.text;
       boolean isMeasurementType = $m.getType() == MeasurementParser.MEASUREMENT;
       System.out.println("contents="+contents+", isMeasurementType="+isMeasurementType);
     })+ EOF
  ;

MEASUREMENT
  :  MILLIMETRE
  ;

fragment
MILLIMETRE
  :  'millimetre' 
  |  'millimetres'
  |  'millimeter' 
  |  'millimeters'
  |  'mm'
  ;

SPACE
  :  (' ' | '\t' | '\r' | '\n'){$channel=HIDDEN;}
  ;

with input text:

"millimeters mm"

will print:

contents=millimeters, isMeasurementType=true
contents=mm, isMeasurementType=true

in other words: the type MILLIMETRE does not exist, they're all of type MEASUREMENT.

Bart Kiers 2010-09-30 14:34:57

ansaurus

tags:

views:

answers:

Matching lexeme variants with Antlr3

related questions