views:

388

answers:

2

I have XML that I need to parse but have no control over the creation of. Unfortunately it's not very strict XML and contains things like:

<mytag>This won't parse & contains an ampersand.</mytag>

The javax.xml.stream classes don't like this at all, and rightly error with:

javax.xml.stream.XMLStreamException: ParseError at [row,col]:[149,50]
Message: The entity name must immediately follow the '&' in the entity reference.

How can I work around this? I can't change the XML, so I guess I need an error-tolerant parser.

My preference would be for a fix that doesn't require too much disruption to the existing parser code.

+3  A: 

If it's not valid XML (like the above) then no XML parser will handle it (as you've identified). If you know the scope of the errors (such as the above entity issue), then the simplest solution may be to run a correcting process over it (fixing entities such as inserting entities) and then feed it to an existing parser.

Otherwise you'll have to code one yourself with built-in support for such anomalies. And I can't believe that's anything other than a tedious and error-prone task.

Brian Agnew