We have various XML files produced by an application in current distribution. Some of these files have turned out to contain invalid characters, rendering them invalid XML that won't load in most instances unless all validation is turned off, and then, only in XmlDocument
instances, not XDocument
.
As this app is already out there, we have to cope with the files it produces. Now, I could keep adding to a Sanitizer
type that knows what to look for and how to fix it before trying to load the document, but I was hoping that someone may have already put the effort in to produce something that maybe did this already in an efficient manner (such as a SanitizedXmlReader
class).
This question touches on the same topic but I didn't find a satisfactory answer there. All we want is to remove the content that is invalid in any place in an XML file (rather than data that is valid in say CDATA only or when not used in a QName).
So, does such a thing exist that can take an "almost" XML file and turn it into a "at least there are no invalid characters" XML file? If not, rolling our own is the next option. In this instance, instead of spending time interpreting the XML specification to determine what characters are illegal in all situations, is there a definitive list somewhere?