views:

7587

answers:

6

Is there any easy/general way to clean an XML based data source prior to using it in an XmlReader so that I can gracefully consume XML data that is non-conformant to the hexadecimal character restrictions placed on XML?

Note:

  • The solution needs to handle XML data sources that use character encodings other than UTF-8, e.g. by specifying the character encoding at the XML document declaration. Not mangling the character encoding of the source while stripping invalid hexadecimal characters has been a major sticking point.
  • The removal of invalid hexadecimal characters should only remove hexadecimal encoded values, as you can often find href values in data that happens to contains a string that would be a string match for a hexadecimal character.

Background:

I need to consume an XML-based data source that conforms to a specific format (think Atom or RSS feeds), but want to be able to consume data sources that have been published which contain invalid hexadecimal characters per the XML specification.

In .NET if you have a Stream that represents the XML data source, and then attempt to parse it using an XmlReader and/or XPathDocument, an exception is raised due to the inclusion of invalid hexadecimal characters in the XML data. My current attempt to resolve this issue is to parse the Stream as a string and use a regular expression to remove and/or replace the invalid hexadecimal characters, but I am looking for a more performant solution.

+9  A: 

It may not be perfect, but what I've done in that case is below. You can adjust to use with a stream.

/// <summary>
/// Removes control characters and other non-UTF-8 characters
/// </summary>
/// <param name="inString">The string to process</param>
/// <returns>A string with no control characters or entities above 0x00FD</returns>
public static string RemoveTroublesomeCharacters(string inString)
{
    if (inString == null) return null;

    StringBuilder newString = new StringBuilder();
    char ch;

    for (int i = 0; i < inString.Length; i++)
    {

        ch = inString[i];
        // remove any characters outside the valid UTF-8 range as well as all control characters
        // except tabs and new lines
        if ((ch < 0x00FD && ch > 0x001F) || ch == '\t' || ch == '\n' || ch == '\r')
        {
            newString.Append(ch);
        }
    }
    return newString.ToString();

}
Eugene Katz
Wonderful solution, thank you for posting this.
Jim Beam
@Jim well, feel free to upvote it then :)
Eugene Katz
It didn't catch 0x3c , one of my application thinks that its hex , I am using ibex
Thunder
try dnewcome's solution below.
Eugene Katz
+7  A: 

I like Eugene's whitelist concept. I needed to do a similar thing as the original poster, but I needed to support all Unicode characters, not just up to 0x00FD. The XML spec is:

Char = #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

In .NET, the internal representation of Unicode characters is only 16 bits, so we can't `allow' 0x10000-0x10FFFF explicitly. The XML spec explicitly disallows the surrogate code points starting at 0xD800 from appearing. However it is possible that if we allowed these surrogate code points in our whitelist, utf-8 encoding our string might produce valid XML in the end as long as proper utf-8 encoding was produced from the surrogate pairs of utf-16 characters in the .NET string. I haven't explored this though, so I went with the safer bet and didn't allow the surrogates in my whitelist.

The comments in Eugene's solution are misleading though, the problem is that the characters we are excluding are not valid in XML ... they are perfectly valid Unicode code points. We are not removing `non-utf-8 characters'. We are removing utf-8 characters that may not appear in well-formed XML documents.

public static string XmlCharacterWhitelist( string in_string ) {
 if( in_string == null ) return null;

 StringBuilder sbOutput = new StringBuilder();
 char ch;

 for( int i = 0; i < in_string.Length; i++ ) {
  ch = in_string[i];
  if( ( ch >= 0x0020 && ch <= 0xD7FF ) || 
   ( ch >= 0xE000 && ch <= 0xFFFD ) ||
   ch == 0x0009 ||
   ch == 0x000A || 
   ch == 0x000D ) {
   sbOutput.Append( ch );
  }
 }
 return sbOutput.ToString();
}
dnewcome
A: 

i found some issues, so this is my solution: private static String removeNonUtf8CompliantCharacters( final String inString ) { if (null == inString ) return null; byte[] byteArr = inString.getBytes(); for ( int i=0; i < byteArr.length; i++ ) { byte ch= byteArr[i]; // remove any characters outside the valid UTF-8 range as well as all control characters // except tabs and new lines if ( !( (ch > 31 && ch < 253 ) || ch == '\t' || ch == '\n' || ch == '\r') ) { byteArr[i]=' '; } } return new String( byteArr ); }

savio
A: 
private static String removeNonUtf8CompliantCharacters( final String inString ) {
    if (null == inString ) return null;
    byte[] byteArr = inString.getBytes();
    for ( int i=0; i < byteArr.length; i++ ) {
        byte ch= byteArr[i]; 
        // remove any characters outside the valid UTF-8 range as well as all control characters
        // except tabs and new lines
        if ( !( (ch > 31 && ch < 253 ) || ch == '\t' || ch == '\n' || ch == '\r') ) {
            byteArr[i]=' ';
        }
    }
    return new String( byteArr );
}
savio
A: 

how to run the code? is php or kust at the top of xml? how can I use it? thanks

The context of this question was within the .NET framework.
Oppositional
A: 

Try this for PHP!

$goodUTF8 = iconv("utf-8", "utf-8//IGNORE", $badUTF8);

Kesavan