tags:

views:

621

answers:

1

I am new to parser generators and I am wondering how the ANTLR grammar for an embedded language like JSP/ASP/PHP might look like, but unfortunately the ANTLR site doesn't provide any such grammar files.

More precisely I don't know exactly how to define an AnyText token which matches everything (including keywords which aren't having any meaning outside the code blocks) and still be able to recognize them correctly inside the blocks.

For example the following snipped should be tokenized as something like: AnyText, BlockBegin, Keyword, BlockEnd, AnyText.

lorem ipsum KEYWORD dolor sit <% KEYWORD %> amet

Maybe there is also another parser generator which is suited better for my needs. I have only tried ANTLR up to now, because of its huge popularity here at stackoverflow :)

Many thanks in advance!

+2  A: 

I can't speak for ANTLR, as I use a different lexer/parser (the DMS Software Reengineering Toolkit, for which I have developed precisely such JSP and PHP lexer/parsers. (ASP isn't different as you have observed in your question).

But the basic idea is that the lexer needs lexical modes to recognize when you are picking up "anytext" and when you are processing "real" programming language text. So you need a starting lexical mode, say HTML, whose job is to absorb the HTML text, and when it encounters an transition-into PHP, switches modes. You also need a PHP mode which picks up all the PHP tokens, and switches back to HTML mode when the transition-out characters are encountered. Here's a sketch:

%%HTML -- mode
#token HTMLText "~[]* \< \% "
   << (GotoPHPMode) >>

%%PHP -- mode
#token KEYWORD "KEYWORD"
...
#token '%>'  "\%\>"
   << (GotoHTMLMode) >>

Your lexer generator is likely to have some kind of mode-switching capability that you'll have to use instead of this. And you'll likely find that lexing the HTML stuff is more complicated than it looks (you have to worry about <SCRIPT tags and lots of other crazy HTML stuff, but those are details I presume you can handle.

Ira Baxter
Many thanks for your response. The mode switching might be indeed a solution, although it's still a bit problematic with ANTLR, because only the lexer should be switched and the parser must stay the same. (Otherwise it would be hard to parse things like "<% for ... %>AnyText<% endfor %>").The easiest solution I explored yet is the use of boost::spirit. There, the lexer is called by the parser and so you simple can write as many rules including anychar_p's as you want, without switching mode.
tux21b