views:

608

answers:

2

I'm using XML Serialization heavily in a web service (the contracts pass complex types as params). Recently I noticed that the .Net XML Serialization engine is escaping some of the well known 5 reserved characters that must be escaped when included within an element (<, >, &, ' and "). My first reaction was "good old .Net, always looking out for me".

But then I started experimenting and noticed it is only escaping the <, > and &, and for some reason not the apostrophy and double quotes. For example if I return this literal string in a field within a complex type from my service:

 Bad:<>&'":Data

This is what is transferred over the wire (as seen from Fiddler):

 Bad:&lt;&gt;&amp;'":Data

Has anyone run into this or understand why this is? Is the serializer simply overlooking them or is there a reason for this? As I understand it the ' and " are not by spec valid within an xml element.

A: 

XMLSpy says you're wrong. The following is well-formed XML:

<root>
 <data>'"</data>
</root>


Aside from "argument by reference to XMLSpy", a better argument is that the XML Serializer has been out in the wild for over seven years. In this time, I guarantee someone has tried to serialize "O'Brien" in a Name property. This bug would have been noticed by now.

John Saunders
+3  A: 

According to the XML spec, for regular content and markup:

  • & always needs to be escaped as &amp; because it's the escape character
  • < always needs to be escaped as &lt; since it determines the start of an element. It even has to be escaped within attributes as a safety and to make writing parser error detection simpler.
  • > does not need to be escaped as &gt; but often is for symmetry with <
  • ' needs to be escaped as &apos; only if in an attribute delimited by '
  • " needs to be escaped as &quot; only if in an attribute delimited by "

Inside of processing instructions, comments and CDATA sections, the rules change some, but the details are in the 2.4 Character Data and Markup portion of the spec.

Your serializer is trying to do you a favor by keeping the file somewhat human-readable.

(Each of the above may also be escaped using their numeric equivalents.)

lavinio
Awesome, You're spot on. Thanks for correcting my thinking.
BrettRobi