ansaurus

Question

XML Parsing Problem

Answer 1

A:

Are there multiple writers? Why isn't your parser validating the XML?

Use a tree, where every node represents an element and carries with it a dirty bit. The first occurrence of the node marks it as dirty i.e. you are expecting a closing tag, unless of course the node is of the form <a/>. Also, the first element, you encounter is the root.

When you hit a dirty node, keep pushing nodes in a stack, until you hit the closing tag, when you pop the contents.

dirkgently 2009-02-19 08:21:36

Thanks for that dirkgently it worked like a charm

ardsrk 2009-02-20 04:39:37

Answer 2

+1 A:

Because the XML structure is a hierarchic structure (a tree) a recursion would be the best way to approach this. You can call the recursion on each child and fix the missing XML identifiers. Basically, you'll be doing the same thing a DOM object parser would do, only you'll parse the file in order to fix it's structure. One thing though, it seems to me as if in this method you are going to re-write the XML parser. Isn't it a waist of time? Maybe it's better to find a way for the XML to arrive in the right structure rather than trying to fix it.

Gal Goldman 2009-02-19 09:12:17

Answer 3

A:

In your example, how are you going to figure out exactly where in the content to put the opening <two> tag once you have detected it is missing? This is, as they say, non-trivial.

anon 2009-02-19 10:04:22

Answer 4

+2 A:

What is feeding you the XML from the other end of the socket connection? It doesn't make sense that you should be missing stuff, as you illustrate, just because you receive it from a socket.

If the socket is using TCP (or a custom protocol with similar properties), you should not be missing parts of your XML. Thus, you should be able to just buffer it all until the other end signals "end of document", and then feed it to your picky XML parser.

If you are using UDP or some other "lossy" protocol, you need to reconsider, since it's obviously not possible to correctly transfer a large XML document over a channel that randomly drops pieces.

unwind 2009-02-19 10:09:04

Answer 5

+7 A:

Short answer: You're doing it wrong.

Your question confuses two separate issues:

Parsing of data that is not well-formed XML at all, i.e. so-called tag soup.

Example: Files generated by programmers who do not understand XML or have lousy coding practices.
- It is not unfair to say: A file that is not well-formed XML is not an XML document at all. Every correct XML parser will reject it. Ideally you would work to correct the source of this data and make sure that proper XML is generated instead.
- Alternatively, use a tag soup parser, i.e. a parser that does error correction.
  
  Useful tag soup parsers are often actually HTML parsers. tidy has already been pointed out in another answer.
  
  Make certain that you understand what correction steps such a parser actually performs, since there is no universal approach that could fix XML. Tidy in particular is very aggressive at "repairing" the data, more aggressive than real browsers and the HTML 5 spec, for example.
XML parsing from a socket, where data arrives chunk-by-chunk in a stream. In this situation, the XML document might be viewed as "infinite", with chunks being processed as the appear, long before a final end tag for the root element has been seen.

Example: XMPP is a protocol that works like this.
- The solution is to use a pull-based parser, for example the XMLTextReader API in libxml2.
- If a tree-based data structure for the XML child elements being parser is required, you can build a tree structure for each such element that is being read, just not for the entire document.

David Lichteblau 2009-02-19 10:17:08

David, I was also thinking of changing the SAX like parser to a pull-based parser. May be in the next release. Thanks.

ardsrk 2009-02-20 04:41:17

ansaurus

tags:

views:

answers:

XML Parsing Problem

related questions