views:

1111

answers:

3

Hello everyone,

I have an XML structure like this, some Student item contains invalid UTF-8 byte sequenceswhich may cause XML parsing fail for the whole XML document.

What I want to do is, filter out Student item which contains UTF-8 byte sequences, and keep the valid byte sequences ones. Any advice or samples about how to do this in .Net (C# preferred)?

BTW: invalid byte sequences I mean => http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences

<?xml version="1.0" encoding="utf-8"?>
<AllStudents>
  <Student>
    Mike
  </Student>
  <Student>
    (Invalid name here)
  </Student>  
</AllStudents>

thanks in advance, George

+1  A: 

Very close from XML encoding issue.

bortzmeyer
That question is dealing with how to check the whole XML document is valid, but this question is about how to filter out invalid ones. Any ideas for this question?
George2
+2  A: 

That's pretty hard to do. You won't get an XML parser to parse a document with invalid characters in it, so I think you're reduced to a couple of options:

  1. Figure out why the encoding is wrong - a common problem is labeling the document as UTF-8 (or having no encoding declaration) when the document is actually written in Latin-1.
  2. Take out the bad sections by hand.
  3. Try and find a tag soup parser for .NET that will continue parsing after the error.
  4. Reject the invalid XML document.
John Snelson
Any ways to use regular expression to do such kinds of check?
George2
BTW: tag soup is for Java, not .Net?
George2
+1  A: 

I don't know C#, so I'm afraid I can't give you code to do this, but the basic idea is to read the whole file as a utf-8 text file, using a DecoderFallback to replace invalid sequences with either question mark characters or the unicode chacter 0xFFFD. Then write the file back out as a utf-8 text file, and parse that.

Basically, you separate out the operation of "wiping out bad utf-8 sequences" from the operation of "parsing the xml file".

You should probably even be able to skip writing the file back out again before running the XML parser to read in the fixed data; there should be some way to write the file to an in-memory byte stream and parse that byte stream as XML. (Again, sorry for not knowing C#)

Daniel Martin