I have a webpage that accepts HTML-input from users. The input is converted into an xml document using the System.Xml
namespace, like this:
var doc = new XmlDocument();
doc.AppendChild(doc.CreateElement("root"));
doc.DocumentElement.SetAttribute("BodyHTML", theTextBox.Text);
Afterwards an Xsl transformation (System.Xml.Xsl.XslCompiledTransform
) is used on the data.
Users tend to write text in Microsoft Word, using bullets, quotes etc. When pasting to my page, their text includes invalid characters such as 0x0C, 0x03 and so on. When using the xsl transformation, this error occurs "hexadecimal value 0x0C, is an invalid character."
My fix so far has been to eliminate the characters that I've found to be offensive, using loops and String.Replace
:
All characters from 0 to 31, except 9, 10 and 13 are replaced with String.Empty
.
What I'm looking for is a better way to do this. A built-in .Net method? Or perhaps just a complete list of illegal unicode characters.