tags:

views:

41

answers:

3

I am getting this error when parsing an incorrectly-generated XML document:

org.xml.sax.SAXParseException: The value of attribute "bar" associated with an element type "foo" must not contain the '<' character.

I know what is causing the problem. It is this line:

<foo bar="x<y">42</foo>

It should have been

<foo bar="x&lt;y">42</foo>

I am aware that this is not valid XML, but my code has to download and parse similar files unattended and for political reasons it might not be possible to persuade the supplier to fix the faulty program, especially when other programs are reading the file and tolerating this error.

Is there any way to configure Xerces to tolerate it? At present it treats it as a fatal error. Implementing an ErrorHandler to ignore it is not satisfactory because then the remainder of the document is not parsed.

Alternatively can you suggest another stream-based parser that can be configured to tolerate this error? Using a DOM parser is not feasible as these documents run into hundreds of megabytes.

+3  A: 

I don't think you will find any XML parsers that will tolerate this sort of error. The only thing I can suggest is that you pre-process the XML to remove errors that might occur.

Dunderklumpen
The funny thing about pre-processing, if you think about it being used in the example presented, is that you need to understand the various XML contexts like the start of a new node etc. In other words, some basic XML parsing logic itself needs to be done, that can recognize the need for XML encoding in a context and apply it. It looks like it would involve writing a "tolerant" XML parser just like the OP wants.
Vineet Reynolds
@Vineet. The problem is that it cannot be done in a general way. The preprocessing needs to be done based on @finnw's knowledge of what the vendor XML ought to look like, a systematization of the observed mistakes and then pattern-based matching and correction. Without using @finnw's knowledge, there could be many possible corrections and no way for a hypothetical error tolerant parser to pick the right one.
Stephen C
@Stephen, that's precisely why pre-processing cannot be a blanket solution. One can use regexes and other schemes to discover these errors, but if they're random and could occur in any attribute value or text node, then what?
Vineet Reynolds
@Vineet - obviously. Truly random errors are hard to deal with. And any sufficiently broken XML cannot be fixed by any means. But an (ideal) human directed corrector will do better than an (ideal) corrector that takes no human direction ... because in practice errors are likely to be (somewhat) systematic rather than truly random. For instance, the error in the example is most likely caused by some broken software that doesn't escape special characters in values of certain attributes. Knowing this, you can design your regexes to match those attributes and correct their values.
Stephen C
+4  A: 

... and for political reasons it might not be possible to persuade the supplier to fix the faulty program ...

For political reasons you ought to try your damnedest to get them to fix it. Wave the requirements specification in front of them that says that the input must be well-formed XML. Threaten to bill them for the cost of developing a bespoke parser. (OK, that probably won't work ...)

By giving up without a fight, you are just leaving the problem to trouble other people who have to deal with this supplier in the future.

Stephen C
+1  A: 

I found this article in a google search:

http://www.cs.sfu.ca/~cameron/REX.html

It suggests a strategy for error correction and also explains why this particular error (unescaped < within an attribute value) is more serious than it looks.

finnw