views:

541

answers:

3

Given a Stream as input, how do I safely create an XPathNavigator against an XML data source?

The XML data source:

  • May possibly contain invalid hexadecimal characters that need to be removed.
  • May contain characters that do not match the declared encoding of the document.

As an example, some XML data sources in the cloud will have a declared encoding of utf-8, but the actual encoding is windows-1252 or ISO 8859-1, which can cause an invalid character exception to be thrown when creating an XmlReader against the Stream.

From the StreamReader.CurrentEncoding property documentation: "The current character encoding used by the current reader. The value can be different after the first call to any Read method of StreamReader, since encoding autodetection is not done until the first call to a Read method." This seems indicate that CurrentEncoding can be checked after the first read, but are we stuck storing this encoding when we need to write out the XML data to a Stream?

I am hoping to find a best practice for safely creating an XPathNavigator/IXPathNavigable instance against an XML data source that will gracefully handle encoding an invalid character issues (in C# preferably).

A: 

When using a XmlTextReader or something similiar, the reader itself will figure out the encoding declared in the xml file.

Markus Nigbur
StreamReader.CurrentEncoding: "The current character encoding used by the current reader. The value can be different after the first call to any Read method of StreamReader, since encoding autodetection is not done until the first call to a Read method." So CurrentEncoding after read is recommended?
Oppositional
+1  A: 

It's possible to use the DecoderFallback class (and a few related classes) to deal with bad characters, either by skipping them or by doing something else (restarting with a new encoding?).

Doug McClean
I am not sure if this would work, but it seems like a good way. The only thing I would be able to come up with is rolling a custom XML parser. Good answer.
Jonathan C Dickinson
+2  A: 

I had a similar issue when some XML fragments were imported into a CRM system using the wrong encoding (there was no encoding stored along with the XML fragments).

In a loop I created a wrapper stream using the current encoding from a list. The encoding was constructed using the DecoderExceptionFallback and EncoderExceptionFallback options (as mentioned by @Doug). If a DecoderFallbackException was thrown during processing the original stream is reset and the next-most-likely encoding is used.

Our encoding list was something like UTF-8, Windows-1252, GB-2312 and US-ASCII. If you fell off the end of the list then the stream was really bad and was rejected/ignored/etc.

EDIT:

I whipped up a quick sample and basic test files (source here). The code doesn't have any heuristics to choose between code pages that both match the same set of bytes, so a Windows-1252 file may be detected as GB2312, and vice-versa, depending on file content, and encoding preference ordering.

devstuff
This sounds like a good solution to the problem, could you provide some example code?
Oppositional
Added sample link
devstuff
Thanks! You got the bounty.
Oppositional
Very nice answer!
Jarrod Dixon