tags:

views:

426

answers:

4

So, just as a fun project, I decided I'd write my own XML parser. No, not to parse a specific document, and no, not using an XML parser library. I mean writing code to parse out any XML document into a usable data structure. Just because I like the challenge. :-)

With that said, so far it's proved to be... interesting. It's not as easy to parse (especially when you start taking into account special characters, CDATA, empty tags, comments, etc.) as it initially looked.

Are there any well documented XML parsing algorithms or explanations anywhere that anyone knows of? It seems like there are well-documented Queue and Stack and BTree and etc. etc. etc. implementations everywhere, but I'm not sure I've ever seen a simple, well-documented XML parser algorithm...

I repeat: I am not looking for a pre-built parser library! I am looking for information on how to create my own pre-built parser library! Do not tell me "use expat" or "use SAX" or whatever. That's not what I'm asking for.

A: 

http://expat.sourceforge.net/

Expat is an XML parser library written in C. It is a stream-oriented parser in which an application registers handlers for things the parser might find in the XML document (like start tags). An introductory article on using Expat is available on xml.com.

Nick Brooks
Please read the question. This is exactly what I *don't* want. I want to *write something like expat*, not just *use expat*.
Keith Palmer
+4  A: 

Antlr offers a tutorial on parsing XML. It breaks the process down into phases: lexing, parsing, tree parsing, etc. Looks pretty interesting.

Corbin March
+1 for suggesting a parser generator
kdgregory
A: 

I don't know if it would be "cheating" in your book, but you could try parsing your XML with a ready-built all-purpose language parser like ANTLR. The result would be a list of tokens (if you just use the lexer) or a parse tree (if you include the parser) and you could then re-build the parse tree almost 1:1 into an XML structure.

Maybe. I haven't thought about the ways in which XML might be different from "normal" ANTLR fodder like programming languages, and whether you would be able to define a suitable grammar.

Carl Smotricz
A: 

VTD-XML is probably the simplest parsing technique possible...

vtd-xml-author
Read the question, I'm not looking for a pre-built library, I'm looking for algorithms or tutorials on how to *create my own library*.
Keith Palmer
I think I am refering to the virtual token descriptor which is what vtd-xml implements
vtd-xml-author
Spam, again? Don't you learn?
John Saunders
without using an offensive language, the most I can say is that let the one who poses question judge, you are getting excessively annoying (and I can't believe how un-smart you are)
vtd-xml-author
Note that Mr. Zhang is the author of VTD-XML.
John Saunders