ansaurus

Question

How can my ANTLR lexer match a token made of characters that are subset of another kind of token?

Answer 1

+3 A:

Seems that you have 3 cases:

A
AN
A(A|N)(A|N)+

You could classify the middle one as special_ident and the other two as ident; seems that should do the trick.

I'm a bit rusty with ANTLR, I hope this hint is enough. I can try to write out the expressions for you but they could be wrong:

long_ident    : LETTER (LETTER | DIGIT) (LETTER | DIGIT)+
special_ident : LETTER DIGIT;
ident         : LETTER | long_ident;

Carl Smotricz 2010-01-31 21:44:27

Answer 2

+2 A:

Expanding on Carl's thought, I would guess you have four different cases:

A
AN
AA(A|N)*
AN(A|N)+

Only option 2 should be token special_ident and the other three should be ident. All tokens can be identified by syntax alone. Here is a quick grammar I was able to test in ANTLRWorks and it appeared to work properly for me. I think Carl's might have one bug when trying to check AA , but getting you 99% there is a huge benefit, so this is only a minor modification to his quick thought.

prog 
    :    (expr WS)+ EOF;

expr 
    : special_ident {System.out.println("Found special_ident:" + $special_ident.text + "\n");}
    | ident {System.out.println("Found ident:" + $ident.text + "\n");}
    ;

special_ident : LETTER DIGIT;

ident         : LETTER 
    |LETTER DIGIT (LETTER|DIGIT)+
    |LETTER LETTER (LETTER|DIGIT)*;

LETTER : 'A'..'Z';
DIGIT  : '0'..'9';
WS 
    :   (' '|'\t'|'\n'|'\r')+;

WayneH 2010-02-01 17:37:23

Thanks... I think this is all making more sense. is the last option in `ident` redundant? Wouldn't `LETTER LETTER` make the whole rule be equivalent? Also, would it be equivalent for the entire rule to say `LETTER LETTER? | LETTER DIGIT (LETTER|DIGIT)+`?

Chris Farmer 2010-02-01 19:20:00

There are several different ways you can have the rules (I think), I was just making sure the LETTER DIGIT has another letter or digit after to separate it from the special_ident rule. The LETTER LETTER option does not require any more tokens after. That is why one has a plus sign and the other has the asterisk.

WayneH 2010-02-01 23:26:54

ansaurus

tags:

views:

answers:

How can my ANTLR lexer match a token made of characters that are subset of another kind of token?

related questions