I'm reading about compilers and parsers architecture now and I wonder about one thing... When you have XML, XHTML, HTML or any SGML-based language, what would be the role of a lexer here and what would be the tokens?
I've read that tokens are like words prepared for parsing by the lexer. Although I don't have problem with finding tokens for languages line C, C++, Pascal etc., where there are keywords, names, literals and other word-like strings separated by whitespace, with XML i have a problem, because there aren't any words! It's only plain text interleaved with the markup (tags).
I thought to myself that it could be that these tags and plain text fragments are the tokens, something like that: [TXT][TAG][TAG][TXT][TAG][TXT][TAG][TAG][TXT]...
. It would be quite reasonable, since SGML doesn't care what's inside the markup delimiters <
and >
(well, it recognizes special processing instructions and definitions when it founds ?
or !
as the next character; comments belong to that group too), and the SGML tokenizer could be a base for the XML/HTML/XHTML parser.
But then I realized that there can be <
characters stuffed inside the markup as a part of other syntax: attribute values :-/ Even if it's not quite good idea to put <
characters inside attribute values (it's better to use <
for that), many browsers and editors deal with that and treat these <
as a part of the attribute value, not a tag delimiter.
It complicates things a bit, because I don't see a way to recognize markup like that by a simple Deterministic Finite Automaton (DFA) in the lexer. It looks like it requires a separate context for the automaton when it's inside the tag, and another context when it encounters an attribute value. This would need a stack of states/contexts I think, so DFA might not handle that. Am I right?
What's your view? Is it good to make tokens from tags (markup) and plain text?
Here: http://www.antlr.org/wiki/display/ANTLR3/Parsing+XML
is used some kind different technique: they treat <
and >
(and also </
and />
) as separate tokens, and inside the tags they use GENERIC_ID
as a token etc.They generally move most of the work to the parser. But they also have to change contexts for the tokenizer: they use different context in the plain text, and different in markup (but they forgot about attribute values context I think, because first occurence of >
will end the tag in their lexer).
So what's the best approach for parsing SGML-like languages? Is the lexer really used there? If yes, what strings constitute the tokens?