views:

25

answers:

1

I am writing an ANTLR grammar to recognize HTML block-level elements within plain text. Here is a relevant snippet, limited to the div tag:

grammar Test;

blockElement
  : div
  ;

div
  : '<' D I V HTML_ATTRIBUTES? '>' (blockElement | TEXT)* '</' D I V '>'
  ;

D : ('d' | 'D') ;
I : ('i' | 'I') ;
V : ('v' | 'V') ;

HTML_ATTRIBUTES
  : WS (~( '<' | '\n' | '\r' | '"' | '>' ))+
  ;

TEXT
  : (. | '\r' | '\n')
  ;

fragment WS
  : (' ' | '\t')
  ;

The TEXT token is supposed to represent anything that is no block-level element, such as plain text or inline tags (e. g. <b><\b>). When I test it on nested block elements, like:

<div level_0><div level_1></div></div>

it parses them correctly. However, as soon as I add some random text, it throws a MismatchedTokenException(0!=0) right after having consumed the first TEXT token, e. g. the capital T in:

<div level_0>This is some random text</div>

Any suggestions? Am I doing something conceptually wrong? I am using ANTLR v. 3.2 and doing the testing with ANTLRWorks v. 1.4.

Thank you

+1  A: 

I recommend not testing your grammar with ANTLRWorks: error messages are easily missed in the console and it might therefor interpret your test input not as you expect. Do it with a custom created class like this:

import org.antlr.runtime.*;

public class Main {
    public static void main(String[] args) throws Exception {
        ANTLRStringStream in = new ANTLRStringStream("<div level_0>This is some random text</div>");
        TestLexer lexer = new TestLexer(in);
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        TestParser parser = new TestParser(tokens);
        Sparser.parse());
    }
}

Now, the following rule is not correct:

TEXT
  :  (. | '\r' | '\n')
  ;

The . already matches both \r and \n, so it should be:

TEXT
  :  .
  ;

When changing that, you can create a parser & lexter, compile all .java files and run the Main class:

java -cp antlr-3.2.jar org.antlr.Tool Test.g
javac -cp antlr-3.2.jar *.java
java -cp .:antlr-3.2.jar Main

which will produce the following error:

line 1:15 mismatched input 'i' expecting '</'

because the i from This is being tokenized by the rule I : ('i' | 'I') ;.

There are more problems with your current approach:

  • HTML_ATTRIBUTES does too much: you should instead have ATTRIBUTE, = and VALUE rules and then move the plural (html attributes) to your parser instead;
  • now your attributes cannot contain < and > which is incorrect (the can contain them, although it is not recommend).

I'd start over if I were you. If you want, I'm willing to propose a start: just says so.

Bart Kiers
Thanks, it seems I have misunderstood some of the fundamentals when it comes to priority rules. Back to the reference I go! Also, thank you for your offer, but I guess I need to cover the basics a bit better before starting over.
ASV
@ASV, sure, no problem.
Bart Kiers