views:

7425

answers:

10

I was looking for a generic method in .Net to encode a string for use in an Xml element or attribute, and was surprised when I didn't immediately find one. So, before I go too much further, could I just be missing the built-in function?

Assuming for a moment that it really doesn't exist, I'm putting together my own generic EncodeForXml(string data) method, and I'm thinking about the best way to do this.

The data I'm using that prompted this whole thing could contain bad characters like &, <, ", etc. It could also contains on occasion the properly escaped entities: &amp;, &lt;, and &quot;, which means just using a CDATA section may not be the best idea. That seems kinda klunky anyay; I'd much rather end up with a nice string value that can be used directly in the xml.

I've used a regular expression in the past to just catch bad ampersands, and I'm thinking of using it to catch them in this case as well as the first step, and then doing a simple replace for other characters.

So, could this be optimized further without making it too complex, and is there anything I'm missing? :

Function EncodeForXml(ByVal data As String) As String
    Static badAmpersand As new Regex("&(?![a-zA-Z]{2,6};|#[0-9]{2,4};)")

    data = badAmpersand.Replace(data, "&amp;")

    return data.Replace("<", "&lt;").Replace("""", "&quot;").Replace(">", "gt;")
End Function

Sorry for all you C# -only folks-- I don't really care which language I use, but I wanted to make the Regex static and you can't do that in C# without declaring it outside the method, so this will be VB.Net

Finally, we're still on .Net 2.0 where I work, but if someone could take the final product and turn it into an extension method for the string class, that'd be pretty cool too.

Update The first few responses indicate that .Net does indeed have built-in ways of doing this. But now that I've started, I kind of want to finish my EncodeForXml() method just for the fun of it, so I'm still looking for ideas for improvement. Notably: a more complete list of characters that should be encoded as entities (perhaps stored in a list/map), and something that gets better performance than doing a .Replace() on immutable strings in serial.

+9  A: 

In the past I have used HttpUtility.HtmlEncode to encode text for xml. It performs the same task, really. I havent ran into any issues with it yet, but that's not to say I won't in the future. As the name implies, it was made for HTML, not XML.

You've probably already read it, but here is an article on xml encoding and decoding.

EDIT: Of course, if you use an xmlwriter or one of the new XElement classes, this encoding is done for you. In fact, you could just take the text, place it in a new XElement instance, then return the string (.tostring) version of the element. I've heard that SecurityElement.Escape will perform the same task as your utility method as well, but havent read much about it or used it.

EDIT2: Disregard my comment about XElement, since you're still on 2.0

Kilhoffer
A: 

If this is an ASP.NET app why not use Server.HtmlEncode() ?

Kev
This is in a library that will be used for both asp.net apps and batch processing (desktop).
Joel Coehoorn
You can actually access Server.HTMLEncode() in a desktop app - all you have to do is ad a reference to System.Web
amdfan
+3  A: 

System.XML handles the encoding for you, so you don't need a method like this.

MusiGenesis
I'll have to check that- the problems I've had in the past are from _reading_ bad docs generated by others, and I haven't done much writing yet. This would certainly explain the lack of a built-in function.
Joel Coehoorn
Yeah, if the other docs didn't encode correctly, System.XML won't read them correctly.
MusiGenesis
Joel Coehoorn
It would encode the ampersand. Whatever string you put in is exactly what you'll get back out.
MusiGenesis
So then I still need a way to handle incoming data that may be _partially_ encoded.
Joel Coehoorn
Or go shout at the guys who aren't encoding their xml correctly.
Sekhat
+5  A: 

SecurityElement.Escape

documented here

workmad3
This seems like what I'm looking for, but there are some comments at the bottom indicating the implementation is less than stellar.
Joel Coehoorn
+2  A: 

XmlTextWriter.WriteString() does the escaping.

GSerg
+3  A: 

MusiGenesis is exactly right. Here's a C# version:

public static string EncodeForXml(string data) {
    XmlDocument doc = new XmlDocument();
    XmlNode node = doc.AppendChild(doc.CreateElement("xml"));
    node.InnerText = data;
    StringWriter writer = new StringWriter();
    XmlTextWriter xml_writer = new XmlTextWriter(writer);
    node.WriteContentTo(xml_writer);
    return writer.ToString();
}
Ishmael
+1  A: 

This might be the case where you could benefit from using the WriteCData method.

public override void WriteCData(string text)
    Member of System.Xml.XmlTextWriter

Summary:
Writes out a <![CDATA[...]]> block containing the specified text.

Parameters:
text: Text to place inside the CDATA block.

A simple example would look like the following:

writer.WriteStartElement("name");
writer.WriteCData("<unsafe characters>");
writer.WriteFullEndElement();

The result looks like:

<name><![CDATA[<unsafe characters>]]></name>

When reading the node values the XMLReader automatically strips out the CData part of the innertext so you don't have to worry about it. The only catch is that you have to store the data as an innerText value to an XML node. In other words, you can't insert CData content into an attribute value.

Dscoduc
+5  A: 

Depending on how much you know about the input, you may have to take into account that not all Unicode characters are valid XML characters.

Both Server.HtmlEncode and System.Security.SecurityElement.Escape seem to ignore invalid XML characters, while System.XML.XmlWriter.WriteString throws an ArgumentException when it encounters invalid characters (unless you disable that check in which case it ignores them). Overview of library functions here.

I'm thinking a solution to this particular challenge would look like this:

    public static string EscapeXml(string raw) {
        var stripped = new String(
            raw
            .Where(c => (0x1 <= c && c <= 0xD7FF) ||
                        (0xE000 <= c && c <= 0xFFFD) ||
                        (0x100000 <= c && c <= 0x10FFFF))
            .Where(c => !(0x1 <= c && c <= 0x8) &&
                        ! new [] { 0xB, 0xC }.Contains(c) &&
                        !(0xE <= c && c <= 0x1F) &&
                        !(0x7F <= c && c <= 0x84) && 
                        !(0x86 <= c && c <= 9F))
            .ToArray());
        return System.Security.SecurityElement.Escape(stripped);
    }

Based on the solution found here.

This is only a partial solution, since it does not handle the partially encoded data requirement of the original problem nor is it a .NET 2.0 implementation.

Michael Kropat
Good answer, have seen the similar solution from this article: http://seattlesoftware.wordpress.com/2008/09/11/hexadecimal-value-0-is-an-invalid-character/
Pag Sun
That article explains the problem really well.
Michael Kropat
+1  A: 

Microsoft's AntiXss library has methods for this:

AntiXss.XmlEncode(string s)
AntiXss.XmlAttributeEncode(string s)

it has HTML as well:

AntiXss.HtmlEncode(string s)
AntiXss.HtmlAttributeEncode(string s)
Luke Quinane
A: 

It seems there is already a method via the SecurityElement class.

http://www.csharper.net/blog/escape_xml_string_characters_in_c_.aspx

Richard