tags:

views:

457

answers:

5

I have an XML parser that crashes on incomplete XML data. So XML data fed to it could be one of the following:

<one><two>twocontent</two</one>

<a/><b/> ( the parser treats it as two root elements )

Element attributes are also handled ( though not shown above ).

Now, the problem is when I read data from socket I get data in fragments. For example:

<one>one

content</two>

</one>

Thus, before sending the XML to the parser I have to construct a valid XML and send it. What programming construct ( like iteration, recursion etc ) would be the best fit for this kind of scenario.

I am programming in C++.

Please help.

A: 

Are there multiple writers? Why isn't your parser validating the XML?

Use a tree, where every node represents an element and carries with it a dirty bit. The first occurrence of the node marks it as dirty i.e. you are expecting a closing tag, unless of course the node is of the form <a/>. Also, the first element, you encounter is the root.

When you hit a dirty node, keep pushing nodes in a stack, until you hit the closing tag, when you pop the contents.

dirkgently
Thanks for that dirkgently it worked like a charm
ardsrk
+1  A: 

Because the XML structure is a hierarchic structure (a tree) a recursion would be the best way to approach this. You can call the recursion on each child and fix the missing XML identifiers. Basically, you'll be doing the same thing a DOM object parser would do, only you'll parse the file in order to fix it's structure. One thing though, it seems to me as if in this method you are going to re-write the XML parser. Isn't it a waist of time? Maybe it's better to find a way for the XML to arrive in the right structure rather than trying to fix it.

Gal Goldman
A: 

In your example, how are you going to figure out exactly where in the content to put the opening <two> tag once you have detected it is missing? This is, as they say, non-trivial.

anon
+2  A: 

What is feeding you the XML from the other end of the socket connection? It doesn't make sense that you should be missing stuff, as you illustrate, just because you receive it from a socket.

If the socket is using TCP (or a custom protocol with similar properties), you should not be missing parts of your XML. Thus, you should be able to just buffer it all until the other end signals "end of document", and then feed it to your picky XML parser.

If you are using UDP or some other "lossy" protocol, you need to reconsider, since it's obviously not possible to correctly transfer a large XML document over a channel that randomly drops pieces.

unwind
+7  A: 

Short answer: You're doing it wrong.

Your question confuses two separate issues:

  1. Parsing of data that is not well-formed XML at all, i.e. so-called tag soup.

    Example: Files generated by programmers who do not understand XML or have lousy coding practices.

    • It is not unfair to say: A file that is not well-formed XML is not an XML document at all. Every correct XML parser will reject it. Ideally you would work to correct the source of this data and make sure that proper XML is generated instead.

    • Alternatively, use a tag soup parser, i.e. a parser that does error correction.

      Useful tag soup parsers are often actually HTML parsers. tidy has already been pointed out in another answer.

      Make certain that you understand what correction steps such a parser actually performs, since there is no universal approach that could fix XML. Tidy in particular is very aggressive at "repairing" the data, more aggressive than real browsers and the HTML 5 spec, for example.

  2. XML parsing from a socket, where data arrives chunk-by-chunk in a stream. In this situation, the XML document might be viewed as "infinite", with chunks being processed as the appear, long before a final end tag for the root element has been seen.

    Example: XMPP is a protocol that works like this.

    • The solution is to use a pull-based parser, for example the XMLTextReader API in libxml2.

    • If a tree-based data structure for the XML child elements being parser is required, you can build a tree structure for each such element that is being read, just not for the entire document.

David Lichteblau
David, I was also thinking of changing the SAX like parser to a pull-based parser. May be in the next release. Thanks.
ardsrk