tags:

views:

73

answers:

1

I have a relatively complicated lexer problem. Given the following input:

-argument -argument#with hashed data# #plainhashedData#

I need these tokens:

ARGUMENT (Text = "argument")
ARGUMENT (Text = "argument")
EXTRADATA (Text = "with hashed data")
OTHER (Text = "#plainhasheddata#")

I've been able to take care of the text manipulation problems, but I need some way to specify that the EXTRADATA rule can only be matched when the rule just previously matched was ARGUMENT. ANTLR supports syntactic predicates (even in lexers), so this should not be difficult to achieve -- but I need to be able to get what the previously matched token is before I'd be able to write such a predicate.

Is this possible using the ANTLR C code generation target?

Billy3

EDIT: The current lexer rules look something like:

ARGUMENT : '-'+ (~('-'|'#'|' '))+
         ;
EXTRADATA : '#' (~'#')* '#'
          ;
OTHER : ~'-' (~' ')*
      ;
+1  A: 

Note, I know little C, and have no experience with the C runtime of ANTLR, but the Java code from my examples should not be too hard to rewrite into C.


You could do that by overriding the emit(Token) method from the base Lexer class and keeping track of the last Token your lexer processes:

private Token last;

@Override
public void emit(Token token) {
  last = token;
  super.emit(token);
}

To include this in your lexer, add it in your grammar between the following:

@lexer::members {

  // your code here

}

Now you must put the Other rule before your ExtraData rule and put a gated semantic predicate before your Other rule that checks if the last token was a ExtraData token:

Other
  :  {behind(ExtraData)}?=> ~'-' (~' ')*
  ;

where the behind(int) method is a custom method in your @lexer::members { ... } section:

protected boolean behind(int tokenType) {
  return last != null && last.getType() == tokenType;
}

which will cause the Other token to be matched only if the last token was a ExtraData.

A little demo-grammar of it all:

grammar LookBehind;

@lexer::members {

  private Token last;

  @Override
  public void emit(Token token) {
    last = token;
    super.emit(token);
  }

  protected boolean behind(int tokenType) {
    return last != null && last.getType() == tokenType;
  }
}

parse
  :  token+ EOF
  ;

token
  :  Argument  {System.out.println("Argument  :: "+$Argument.text);}
  |  Other     {System.out.println("Other     :: "+$Other.text);}
  |  ExtraData {System.out.println("ExtraData :: "+$ExtraData.text);}
  ;

Argument
  :  '-'+ (~('-' | '#' | ' '))+
  ;

Other
  :  {behind(ExtraData)}?=> ~('-' | ' ') (~' ')*
  ;

ExtraData 
  : '#' (~'#')* '#'
  ;

Space
  :  (' ' | '\t' | '\r' | '\n') {skip();}
  ;

and a main-class to test it:

import org.antlr.runtime.*;

public class Main {
    public static void main(String[] args) throws Exception {
        String source = "-argument -argument#with hashed data# #plainhashedData#";
        ANTLRStringStream in = new ANTLRStringStream(source);
        LookBehindLexer lexer = new LookBehindLexer(in);
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        LookBehindParser parser = new LookBehindParser(tokens);
        parser.parse();
    }
}

First generate a parser and lexer from the grammar:

java -cp antlr-3.2.jar org.antlr.Tool LookBehind.g 

then compile all .java files:

javac -cp antlr-3.2.jar *.java

and finally run the main class:

java -cp .:antlr-3.2.jar Main

(on Windows do: java -cp .;antlr-3.2.jar Main)

which then will produce the following output:

Argument  :: -argument
Argument  :: -argument
ExtraData :: #with hashed data#
Other     :: #plainhashedData#

EDIT

As you (Billy) mentioned in your comment, in C you can't override methods. You could also set a boolean flag in the @after{ ... } clause of each lexer rule to keep track of when the last token is a ExtraData and use that flag in your predicate:

grammar LookBehind;

@lexer::members {
  private boolean lastExtraData = false;
}

parse
  :  token+ EOF
  ;

token
  :  Argument  {System.out.println("Argument  :: "+$Argument.text);}
  |  Other     {System.out.println("Other     :: "+$Other.text);}
  |  ExtraData {System.out.println("ExtraData :: "+$ExtraData.text);}
  ;

Argument
@after{lastExtraData = false;}
  :  '-'+ (~('-' | '#' | ' '))+
  ;

Other
@after{lastExtraData = false;}
  :  {lastExtraData}?=> ~('-' | ' ') (~' ')*
  ;

ExtraData
@after{lastExtraData = true;}
  : '#' (~'#')* '#'
  ;

Space
  :  (' ' | '\t' | '\r' | '\n') {skip();}
  ;

Although this is a bit of a hack: in every lexer rule you'll have to set the flag.

You might also post a question to the ANTLR mailing-list: besides many ANTLR experts, the person maintaining ANTLR's C-runtime frequents there.

Good luck!

Bart Kiers
Err.. that does become somewhat of a problem. How does one override the emit method in a language without classes? :P Thank you though.
Billy ONeal
~:), errrhm, yeah... I don't know exactly. Now you know I didn't lie about knowing little about C! :). I had thought something would have been possible...
Bart Kiers
@Billy, see my edit.
Bart Kiers
@Bart: I have posted to the mailinglist, but will use the hack as a workaround if needed.
Billy ONeal