tags:

views:

406

answers:

3

Most XML parsers will give up after the first error in a document. In fact, IIRC, that's actually part of the 'official' spec for parsers.

I'm looking for something that will break that rule. It should take a given schema (assuming a valid schema) and an xml input and attempt to keep going after the first error and either raise an event for each error or return a list when finished, so I can use it to generate some kind of a report of the errors in the document. This requirement comes from above, so let's try to keep the purist "but it wouldn't make sense to keep going" comments to a minimum.

I'm looking for something that will evaluate both whether the document is well-formed and whether or not it conforms to the schema. Ideally it would evaluate those as different classes of error. I'd prefer a .Net solution but I could use a standalone .exe as well. If you know of one that uses a different platform go ahead and post it because someone else might find it useful.

Update: I expect that most of the documents where I use this will be mostly well-formed. Maybe an & included as data instead of &amp here and there, or an occasional mis-placed tag. I don't expect the parser to be able to recover from anything, just to make a best-effort to keep going. If a document is too out of whack it should spit out as much as it can followed by some kind of 'fatal, unable to continue' error. Otherwise the schema validation part is pretty easy.

+1  A: 

In fact, IIRC, that's actually part of the 'official' spec for parsers.

Official does not need to be quoted :)

fatal error

[Definition:] An error which a conforming XML processor must detect and report to the application. After encountering a fatal error, the processor may continue processing the data to search for further errors and may report such errors to the application. In order to support correction of errors, the processor may make unprocessed data from the document (with intermingled character data and markup) available to the application. Once a fatal error is detected, however, the processor must not continue normal processing (i.e., it must not continue to pass character data and information about the document's logical structure to the application in the normal way).

You could use xmllint with the recover option.

Damien B
A: 

Sounds like you might want TagSoup. It may not be exactly what you want, but as far as bad-document-handling parsers go it's the gold standard.

Craig Walker
+1  A: 

Xerces has a feature you can set on to try and continue after a fatal error:

http://apache.org/xml/features/continue-after-fatal-error
True: Attempt to continue parsing after a fatal error.
False: Stops parse on first fatal error.
Default: false
Note: The behavior of the parser when this feature is set to true is undetermined! Therefore use this feature with extreme caution because the parser may get stuck in an infinite loop or worse.

jelovirt