tags:

views:

267

answers:

2

I have what I think is a simple ANTLR question. I have two token types: ident and special_ident. I want my special_ident to match a single letter followed by a single digit. I want the generic ident to match a single letter, optionally followed by any number of letters or digits. My (incorrect) grammar is below:

expr 
    : special_ident
    | ident
    ;

special_ident : LETTER DIGIT;
ident         : LETTER (LETTER | DIGIT)*;

LETTER : 'A'..'Z';
DIGIT  : '0'..'9';

When I try to check this grammar, I get this warning:

Decision can match input such as "LETTER DIGIT" using multiple alternatives: 1, 2. As a result, alternative(s) 2 were disabled for that input

I understand that my grammar is ambiguous and that input such as A1 could match either ident or special_ident. I really just want the special_ident to be used in the narrowest of cases.

Here's some sample input and what I'd like it to match:

A      : ident
A1     : special_ident
A1A    : ident
A12    : ident
AA1    : ident

How can I form my grammar such that I correctly identify my two types of identifiers?

+3  A: 

Seems that you have 3 cases:

  • A
  • AN
  • A(A|N)(A|N)+

You could classify the middle one as special_ident and the other two as ident; seems that should do the trick.

I'm a bit rusty with ANTLR, I hope this hint is enough. I can try to write out the expressions for you but they could be wrong:

long_ident    : LETTER (LETTER | DIGIT) (LETTER | DIGIT)+
special_ident : LETTER DIGIT;
ident         : LETTER | long_ident;
Carl Smotricz
+2  A: 

Expanding on Carl's thought, I would guess you have four different cases:

  1. A
  2. AN
  3. AA(A|N)*
  4. AN(A|N)+

Only option 2 should be token special_ident and the other three should be ident. All tokens can be identified by syntax alone. Here is a quick grammar I was able to test in ANTLRWorks and it appeared to work properly for me. I think Carl's might have one bug when trying to check AA , but getting you 99% there is a huge benefit, so this is only a minor modification to his quick thought.

prog 
    :    (expr WS)+ EOF;

expr 
    : special_ident {System.out.println("Found special_ident:" + $special_ident.text + "\n");}
    | ident {System.out.println("Found ident:" + $ident.text + "\n");}
    ;

special_ident : LETTER DIGIT;

ident         : LETTER 
    |LETTER DIGIT (LETTER|DIGIT)+
    |LETTER LETTER (LETTER|DIGIT)*;

LETTER : 'A'..'Z';
DIGIT  : '0'..'9';
WS 
    :   (' '|'\t'|'\n'|'\r')+;
WayneH
Thanks... I think this is all making more sense. is the last option in `ident` redundant? Wouldn't `LETTER LETTER` make the whole rule be equivalent? Also, would it be equivalent for the entire rule to say `LETTER LETTER? | LETTER DIGIT (LETTER|DIGIT)+`?
Chris Farmer
There are several different ways you can have the rules (I think), I was just making sure the LETTER DIGIT has another letter or digit after to separate it from the special_ident rule. The LETTER LETTER option does not require any more tokens after. That is why one has a plus sign and the other has the asterisk.
WayneH