views:

75

answers:

1

I'm reading about compilers and parsers architecture now and I wonder about one thing... When you have XML, XHTML, HTML or any SGML-based language, what would be the role of a lexer here and what would be the tokens?

I've read that tokens are like words prepared for parsing by the lexer. Although I don't have problem with finding tokens for languages line C, C++, Pascal etc., where there are keywords, names, literals and other word-like strings separated by whitespace, with XML i have a problem, because there aren't any words! It's only plain text interleaved with the markup (tags).

I thought to myself that it could be that these tags and plain text fragments are the tokens, something like that: [TXT][TAG][TAG][TXT][TAG][TXT][TAG][TAG][TXT].... It would be quite reasonable, since SGML doesn't care what's inside the markup delimiters < and > (well, it recognizes special processing instructions and definitions when it founds ? or ! as the next character; comments belong to that group too), and the SGML tokenizer could be a base for the XML/HTML/XHTML parser.

But then I realized that there can be < characters stuffed inside the markup as a part of other syntax: attribute values :-/ Even if it's not quite good idea to put < characters inside attribute values (it's better to use &lt; for that), many browsers and editors deal with that and treat these < as a part of the attribute value, not a tag delimiter.

It complicates things a bit, because I don't see a way to recognize markup like that by a simple Deterministic Finite Automaton (DFA) in the lexer. It looks like it requires a separate context for the automaton when it's inside the tag, and another context when it encounters an attribute value. This would need a stack of states/contexts I think, so DFA might not handle that. Am I right?

What's your view? Is it good to make tokens from tags (markup) and plain text?

Here: http://www.antlr.org/wiki/display/ANTLR3/Parsing+XML
is used some kind different technique: they treat < and > (and also </ and />) as separate tokens, and inside the tags they use GENERIC_ID as a token etc.They generally move most of the work to the parser. But they also have to change contexts for the tokenizer: they use different context in the plain text, and different in markup (but they forgot about attribute values context I think, because first occurence of > will end the tag in their lexer).

So what's the best approach for parsing SGML-like languages? Is the lexer really used there? If yes, what strings constitute the tokens?

+2  A: 

Having built XML and HTML parsers, I have opinions.

Lexemes in general should be recognizable language elements.

For XML and HTML, these correspond basically to

  • TAGBEGIN, things of the form of <NAME
  • TAGEND, of the form of >
  • TAGCLOSE, of the form of </NAME>
  • TAGENDANDCLOSE of the form /> (XML only)
  • ATTRIBUTENAME, of the form of NAME
  • EQUALSIGN, being precisely =
  • ATTRIBUTEVALUE, being the value of the exact character string represented by an attribute, regardless of quotes (or even absence of quotes, for legacy HTML). If there are escaped character codes inside the attribute, those code should be converted to their actual character code.
  • CONTENT, which is the text between TAGENDs and TAGBEGINs. Like ATTRIBUTEVALUES, any escaped characters should be converted, so the CONTENT between <B>foo&lt;bar</B> is converted to the text foo<bar If you want to keep the entity invocations as seperate tokens, you could do that, producing streams of CONTENT and ENTITYINVOCATION tokens between TAGENDs and TAGSTARTs; depends on what your goal is.

We can argue about whether you want to produce a token for HTML/XML comments or not. If you do, you do.

If we ignore the complications of DTDs and Schemas for XML, that's all you really need.

How the lexer produces these is more complicated; with XML and HTML, there's a lot of messiness having to do with escapes in the input stream, <[CDATA ... ]> (if I have that right) which is just a funny kind of quote and vanishes when the CONTENT lexeme is produced. To handle all this, you need a pretty sophisticated lexer engine. And yes, as practical matter, you need different lexical states ("modes") to process different parts of the text. I pretty much have one major mode to process things inside <...>, and one major mode to process CONTENT.

Ira Baxter
Thanks for the very quick and substantive reply :)<br/>Hmm.. Sounds reasonable. 3 more auxiliary questions to that: 1. Does it mean that the `>` won't be treated as `TAGEND` when it is encountered inside the `ATTRIBUTEVALUE` token? 2. How are those additional context states related to the NFA/DFA theory? Are they treated as normal states? Or some higher-level states controling the DFA operation? 3. Wouldn't it be better to use one parser to parse SGML tags and spit it out as tokens for the XML/HTML/XHTML parser on top of it? Is that a sensible approach?
SasQ
@SasQ: 1. <FOO SIZE=">"> would produce tokens TAGBEGIN (with value 'FOO'), ATTRIBUTENAME (with value 'SIZE'), EQUALSIGN, ATTRIBUTEVALUE (with value '>'), TAGEND. 2. The "context states" (I prefer to call them lexical modes) don't relate to NFA/DFA theory; you use that theory *inside* those states to build FSAs that recognize the tokens allowed in those states. We use a higher-level state machine to transition between lexical modes. 3. Sure, you can nest or chain parsers if you want; the lexical mode trick is just a funny version of that, however, its a very effective version.
Ira Baxter