What's the accepted way of storing quoted data in XML?

For example, for a node, which is correct?

  • (a) <name>Jesse "The Body" Ventura</name>
  • (b) <name>Jesse \"The Body\" Ventura</name>
  • (c) <name>Jesse &quot;The Body&quot; Ventura</name>
  • (d) none of the above (please specify)

If (a), what do you do for attributes? If (c), is it really appropriate to mix HTML & XML? Similarly, how do you handle single and curly quotes?


The correct answer is 'C'.

Single quotes don't really cause a problem, but you need to be careful of ampersands and left angle brackets.

Joel Coehoorn
+11  A: 

Your correct answer is A & C as the " is not a character that must be encoded in element data.

You should always be XML encoding characters such as >, <, and & to ensure that you don't have issues if they are NOT inside a CDATA section. These are key items to be concerned about for element data.

When talking about attributes you have to then also be careful of ' and " inside attribute values depending on the type of symbol you use to surround the value.

I've found that often encoding " and ' is a better idea in all aspects as it helps at times when converting to other formats, where the " or ' might cause problems there as well.

Mitchel Sellers
This isn't correct: see section 2.4 in the XML spec (http://www.xml.com/axml/testaxml.htm). Quotes are not included in the characters requiring escape. They do require escape in attributes, but not in normal text.
James Sulak
Yes, that is correct.....
Mitchel Sellers
you should edit you answer to reflect the correct information
Michael Burr
+2  A: 

You shouldn't worry about how things are encoded in your XML. You should always use a proper library for generating XML documents. There's too many gotcha's to XML to get it right by yourself. I've seen tons of invalid XML documents come my way because somebody thought they could generate proper XML themselves, without using a library. All major programming languages in use today have XML libraries.


It depends really. If all you want to do is have quotes in your XML string, then 'A'.

But if there is meaning or you need to abstract the quote (i18n for example), XML affords richer options. For example:

  <nickName>the Body</nickName>

Overkill in many situations. But if you need to correctly handle many of the world's varied - and frequently inconsistent - naming schemes, I'd think about encoding your names along these lines. XML is great for this.

+8  A: 
Michael Burr
+3  A: 

Double quotes in text nodes can be represented either as the double-quote character or as the &quot; entity. Double quotes in attribute values can be represented as the double-quote character if the value is delimited by single quotes, and vice versa; otherwise, escape them as &quot;

This is only relevant if you're a) editing XML in a non-XML-aware text editor or b) creating XML programmatically through string manipulation. Generally speaking, you should avoid (a) unless you really know what you're doing, or at least have a way of checking the well-formedness of your XML after editing is complete.

And you should avoid (b) under all circumstances. Never create XML through string manipulation; always use a DOM or some other tool.

Robert Rossney
+2  A: 

For example, for a node, which is correct?

The XML specification itself doesn't talk about nodes (other than when comparing DTD syntax to finite automaton regex). A DOM node can be attribute, element, text or any of the other node types.

Inside a text node, you only need to escape characters which the parser would interpret as starting a different node - so you escape & and < as &amp; and &lt; .

For portability, it's often a good idea to escape curly quotes, but there is no reason to escape plain quotes in XML text.

Inside an attribute node, you have to escape less-than and ampersand as before, and also whichever quote you used to delimit the attribute.

<foo attribute="'ok'" attribute2='"also-ok"' attribute3="&quot;needed&quot;"/>

It's usually easier to get in the habit of only using one type and always escaping it. I write quite a bit of XSLT and favour using " outside and ' inside:

<xsl:value-of select="person[@name = 'bob']"/>

If you get paranoid with the escaping, the XPath becomes less readable:

<xsl:value-of select="person[@name = &apos;bob&apos;"/>

If (c), is it really appropriate to mix HTML & XML?

XML defines the named entities amp, gt, lt, apos, & quot

HTML defines many more entities.

You can and should use the XML named entities in XML in preference of using a numeric entity.

The lt entity escapes < and should be used in text and attribute values. The amp entity escapes & and should be used in text and attribute values. The apos and quot entities escape ' and " and should be used in attribute values. The gt entity is a bit useless - there is almost never a syntactic requirement to escape > in XML. Maybe > only agreed to work with < if it got equal billing.

The other one I use a lot in XSLT that generates source code is &#xa; which inserts a new line. &nl; would have been more use than &gt;

Similarly, how do you handle single and curly quotes?

XML is designed to mark up Unicode text, and the curly quotes have no special meaning in it. However, it's not uncommon for the encoding used for and XML document to be misinterpreted in the wild. So if it's in a closed environment and can guarantee correct Unicode encoding at producer and consumer then I'd just put it in the XML. Otherwise use a numeric character entity. That's true of any character with a code-point above 127 - there's nothing special about curly quotes.

Pete Kirkham