I am writing an ANTLR grammar to recognize HTML block-level elements within plain text. Here is a relevant snippet, limited to the div tag:
grammar Test;
blockElement
: div
;
div
: '<' D I V HTML_ATTRIBUTES? '>' (blockElement | TEXT)* '</' D I V '>'
;
D : ('d' | 'D') ;
I : ('i' | 'I') ;
V : ('v' | 'V') ;
HTML_ATTRIBUTES
: WS (~( '<' | '\n' | '\r' | '"' | '>' ))+
;
TEXT
: (. | '\r' | '\n')
;
fragment WS
: (' ' | '\t')
;
The TEXT token is supposed to represent anything that is no block-level element, such as plain text or inline tags (e. g. <b><\b>
). When I test it on nested block elements, like:
<div level_0><div level_1></div></div>
it parses them correctly. However, as soon as I add some random text, it throws a MismatchedTokenException(0!=0) right after having consumed the first TEXT token, e. g. the capital T in:
<div level_0>This is some random text</div>
Any suggestions? Am I doing something conceptually wrong? I am using ANTLR v. 3.2 and doing the testing with ANTLRWorks v. 1.4.
Thank you