views:

173

answers:

2

Hello folks,

I'm having a bit of trouble manually emitting a token with a lexer rule in ANTLR. I know that the emit() function needs to be used but there seems to be a distinct lack of documentation about this. Does anybody have a good example of how to do this?

The ANTLR book gives a good example of how you need to do this to parse Python's nesting. For example, if you see a certain amount of whitespace that's greater than the previous line's whitespace, emit an INDENT token but if it's less, emit a DEDENT token. Unfortunately the book glosses over the actual syntax that's required.

EDIT: Here's an example of what I'm trying to parse. It's Markdown's nested blockquotes:

before blockquote

> text1
>
> > text2
>
> text3

outside blockquote

Now, my approach so far is to essentially count the > symbols per line. For example, the above seems like it should emit (roughly...) PARAGRAPH_START, CDATA, PARAGRAPH_END, BQUOTE_START, CDATA, BQUOTE_START, CDATA, BQUOTE_END, CDATA, BQUOTE_END, PARAGRAPH_START, CDATA, PARAGRAPH_END. The difficulty here is the final BQUOTE_END which I think should be an imaginary token emitted once a non-blockquote element is found (and the nesting level is >= 1)

+1  A: 

Well if the token you want to emit is not defined by a lexer rule then you'll need to add a token section like so:

tokens
{
    MYFAKETOKEN
}

In your lexer you will still need a rule that tells the lexer when to produce this token. A common instance is determining if something is an Integer or range or real value.

NUMBERS_OR_RANGE
: INT 
        ( { LA(1) == '.' && LA(2) == '.' }? { _ttype = INT; }
    | { LA(1) == '.' || LA(1) == 'e' || LA(1) == 'E' }? { _ttype = REAL; }
    )
| PERIOD 
    ( PERIOD { _ttype = RANGE; }
    INT (( 'e' | 'E' ) ( '-' | '+' )? INT )? { _ttype = REAL; }
)
;

Here you can see we match either an INT and then lookahead, if we find a double period then we know that the INT is really an int and not a real. In this case we set the variable _ttype to be INT. If we find a PERIOD and then an 'e' we know it's a real.

The second case where we match a period we know that if the next char is a period, then we've got a range otherwise we've got a real.

We could use the MYFAKETOKEN type we defined above to assign to _ttype if that was appropriate.

chollida
Thanks, this is pretty close to what I'm looking for... I have updated the question with a more concrete example, as requested. If you have any insights, that'd be much appreciated!
Scott
A: 

Okay, I did some research and found this: http://www.cforcoding.com/2010/01/markdown-and-introduction-to-parsing.html

I don't think ANTLR is really set up for this sort of task and trying to bend over backwards to do it isn't really worth it.

Scott