views:

202

answers:

1

I have a webpage that accepts HTML-input from users. The input is converted into an xml document using the System.Xml namespace, like this:

var doc = new XmlDocument();
doc.AppendChild(doc.CreateElement("root"));
doc.DocumentElement.SetAttribute("BodyHTML", theTextBox.Text);

Afterwards an Xsl transformation (System.Xml.Xsl.XslCompiledTransform) is used on the data.

Users tend to write text in Microsoft Word, using bullets, quotes etc. When pasting to my page, their text includes invalid characters such as 0x0C, 0x03 and so on. When using the xsl transformation, this error occurs "hexadecimal value 0x0C, is an invalid character."

My fix so far has been to eliminate the characters that I've found to be offensive, using loops and String.Replace: All characters from 0 to 31, except 9, 10 and 13 are replaced with String.Empty.

What I'm looking for is a better way to do this. A built-in .Net method? Or perhaps just a complete list of illegal unicode characters.

+1  A: 

Found two answers which do the same thing

  1. http://seattlesoftware.wordpress.com/2008/09/11/hexadecimal-value-0-is-an-invalid-character/
  2. http://www.theplancollection.com/house-plan-related-articles/hexadecimal-value-invalid-character

The first uses a StringBuilder, loops through characters one by one and filters out illegal chars. The second one uses a Regex and .Replace to accomplish the same thing. Both authors looked at the Xml standard to find out which characters are illegal.

I did some timings on a long string (1.8 MB file run 1,000 times) and a short string ("Hello world" run 10,000,000 times). The StringBuilder method was ~ 3 times faster than the regex. The regex was of course only compiled once, unlike the code to which I linked.

Long string:

CleanInvalidXmlChars time: 00:00:07.4356230
SanitizeXmlString    time: 00:00:02.3703305

Short string:

CleanInvalidXmlChars time: 00:00:05.2805834
SanitizeXmlString    time: 00:00:01.8319114
Martin Ørding-Thomsen