ansaurus

Question

Configure Xerces SAX parser to tolerate an XML syntax error

Answer 1

+3 A:

I don't think you will find any XML parsers that will tolerate this sort of error. The only thing I can suggest is that you pre-process the XML to remove errors that might occur.

Dunderklumpen 2010-07-23 04:41:08

The funny thing about pre-processing, if you think about it being used in the example presented, is that you need to understand the various XML contexts like the start of a new node etc. In other words, some basic XML parsing logic itself needs to be done, that can recognize the need for XML encoding in a context and apply it. It looks like it would involve writing a "tolerant" XML parser just like the OP wants.

Vineet Reynolds 2010-07-23 10:46:39

@Vineet. The problem is that it cannot be done in a general way. The preprocessing needs to be done based on @finnw's knowledge of what the vendor XML ought to look like, a systematization of the observed mistakes and then pattern-based matching and correction. Without using @finnw's knowledge, there could be many possible corrections and no way for a hypothetical error tolerant parser to pick the right one.

Stephen C 2010-07-23 12:15:02

@Stephen, that's precisely why pre-processing cannot be a blanket solution. One can use regexes and other schemes to discover these errors, but if they're random and could occur in any attribute value or text node, then what?

Vineet Reynolds 2010-07-23 12:21:15

@Vineet - obviously. Truly random errors are hard to deal with. And any sufficiently broken XML cannot be fixed by any means. But an (ideal) human directed corrector will do better than an (ideal) corrector that takes no human direction ... because in practice errors are likely to be (somewhat) systematic rather than truly random. For instance, the error in the example is most likely caused by some broken software that doesn't escape special characters in values of certain attributes. Knowing this, you can design your regexes to match those attributes and correct their values.

Stephen C 2010-07-23 12:41:02

Answer 2

+4 A:

... and for political reasons it might not be possible to persuade the supplier to fix the faulty program ...

For political reasons you ought to try your damnedest to get them to fix it. Wave the requirements specification in front of them that says that the input must be well-formed XML. Threaten to bill them for the cost of developing a bespoke parser. (OK, that probably won't work ...)

By giving up without a fight, you are just leaving the problem to trouble other people who have to deal with this supplier in the future.

Stephen C 2010-07-23 04:57:14

Answer 3

+1 A:

I found this article in a google search:

http://www.cs.sfu.ca/~cameron/REX.html

It suggests a strategy for error correction and also explains why this particular error (unescaped < within an attribute value) is more serious than it looks.

finnw 2010-07-23 16:52:57

ansaurus

tags:

views:

answers:

Configure Xerces SAX parser to tolerate an XML syntax error

related questions