views:

122

answers:

3

We have various XML files produced by an application in current distribution. Some of these files have turned out to contain invalid characters, rendering them invalid XML that won't load in most instances unless all validation is turned off, and then, only in XmlDocument instances, not XDocument.

As this app is already out there, we have to cope with the files it produces. Now, I could keep adding to a Sanitizer type that knows what to look for and how to fix it before trying to load the document, but I was hoping that someone may have already put the effort in to produce something that maybe did this already in an efficient manner (such as a SanitizedXmlReader class).

This question touches on the same topic but I didn't find a satisfactory answer there. All we want is to remove the content that is invalid in any place in an XML file (rather than data that is valid in say CDATA only or when not used in a QName).

So, does such a thing exist that can take an "almost" XML file and turn it into a "at least there are no invalid characters" XML file? If not, rolling our own is the next option. In this instance, instead of spending time interpreting the XML specification to determine what characters are illegal in all situations, is there a definitive list somewhere?

+2  A: 

I think this link might help on this issue - http://prettycode.org/2009/05/07/hexadecimal-value-0x-is-an-invalid-character/

adatapost
A great resource! Thanks for this. It may be worth quoting pertinent portions in your answer here, if you have the time. Thanks again.
Jeff Yates
+1  A: 

I used SGMLReader a few years ago to load crappy HTML code. That may help you too to parse invalid XML.

Thomas Freudenberg
Thanks! I had forgotten about SGMLReader.
Jeff Yates
Have you tried any of the answers? Asking because I need to read crappy 3rd-party XML myself in the next future.
Thomas Freudenberg
+1  A: 

Problems

If you do end up writing your own, knowing which characters are valid is definitely a little tricky.

XML 1.1 changed the rules, but let's assume that nobody uses it ('cause hardly anyone does), and stick to 1.0.

XML 1.0 revision 5 changed the rules also from earlier versions, but not in any way you can tell from the document itself. It simplified some things as regards to Unicode, but against the recommendations of some of the original spec authors. Let's also pretend this issue doesn't exist.

Answer

Java has this nice little class, XmlChar, which has methods that you can use to determine which characters are valid for which constructs. .Net doesn't, but the Mono project includes the source to a System.Xml.XmlChar which might help you out.

You could probably start by filtering out all characters which are definitely not allowed anywhere. The XmlChar.IsValid(char c) method from the above Mono class should help.

It would be interesting to know what other types of bad XML that application produces.

lavinio
Thanks. The "definitely not allowed anywhere" characters are the ones I really want to tackle. The others are a minor irritant that can be dealt with later.
Jeff Yates