tags:

views:

866

answers:

4

I'm using a third-party library that returns "XML" that is not valid, because it contains invalid characters, as well as non-declared entities. I need to use a Java XML parser to parse this XML, but it's choking.

Is there a generic way to sanitize this XML so that it becomes valid?

+1  A: 

Try http://jtidy.sourceforge.net/.

Tom Eyckmans
+4  A: 

I think your options are something like:

The first two are more heavyweight, given that they're designed to parse ill formed HTML. If you know that the problems are due to encoding and entities, but otherwise well formed I'd suggest you roll your own:

  • standardize an encoding to UTF-8
  • use a standard encoder for the text between the > and < characters (text entities).
jamesh
+1  A: 

Sounds like you need to figure out if there's a way to automatically clean the data yourself before handing off to a parser. How are certain characters invalid, not valid in the declared character set, or unescaped XML meta-characters such as '<'?

For non-declared entities, I once solved this by configuring a SAX parser with an error handler which basically ignored these errors. That might help you too. See ErrorHandler API.

Dov Wasserman
A: 

For illegal characters, I would recommend implementing filtering Reader; just convert them (assuming these are control characters) with space, or strip out.

Undeclared entities are trickier; some xml parsers allow you to define alternative DTD to use (Woodstox does at least. If so, you could inject DTD that does declare entities you need.

StaxMan