ansaurus

Question

Are XHTML entity encodings valid in XML documents as long as they're contained inside CDATA tags?

Answer 1

+1 A:

Rex M 2009-03-20 04:36:32

No, it is not ignored, it is just passed literally to the application, as pure text.

bortzmeyer 2009-03-20 16:09:25

Answer 2

+4 A:

A CDATA section is for the purpose of allowing literal text that would normally be interpreted in a special way in an XML document. That is, something that looks like an entity reference, or something that looks like XML tags. Anything in a CDATA section can be inside valid XML without a CDATA section; you'll just need to use entity references to encode the various special characters so they won't be treated as XML markup, but as character data that is the value of a tag.

So yes, the following is perfectly valid, as long as it is what you intend:

<?xml version="1.0" encoding="UTF-8" ?> 
<outer>
  <inner><![CDATA[&copy;]]></inner>
</outer>

Here, the value of the inner element is the value © which will not be interpreted by the XML parser as the entity reference for the copyright symbol. You can also do the following:

<?xml version="1.0" encoding="UTF-8" ?> 
<outer>
  <inner><![CDATA[<normally> this looks <like/> &amp; xml </normally>]]></inner>
</outer>

where the value for the inner element is

<normally> this looks <like/> &amp; xml </normally>

To do this without a CDATA section:

<?xml version="1.0" encoding="UTF-8" ?> 
<outer>
  <inner>&lt;normally&gt; this looks &lt;like/&gt; &amp;amp; xml &lt;/normally&gt;</inner>
</outer>

which is much less human-readable, but equivalent as far as an XML parser is concerned. If you did this (assuming that the inner element is defined an a schema or DTD as containing a string and not XML) then your XML parser will complain:

<?xml version="1.0" encoding="UTF-8" ?> 
<outer>
  <inner><normally> this looks <like/> &amp; xml </normally></inner>
</outer>

so you use the CDATA or entity escaping to protect the special characters from the XML parser so the client of the XML data can get the value of inner which happens to contain XML markup characters.

Note: To be clear, the above example is well formed XML, but if the schema or DTD says that the element inner contains xsd:string or equivalent, then it is an invalid XML document.

And no, HTML or XHTML entities that are not defined as part of XML itself are not valid XML unless they are defined. Your XML parser will return an error.

Eddie 2009-03-20 04:41:58

That last example is well-formed isn't it? You're just saying that any DTD or XSD that applied would have to allow nested tags...Just want to make sure I've understood. :-)

2009-03-20 04:51:35

Yes, the last example *is* well formed XML, but it may be invalid XML if the schema or DTD says the content of the "inner" tag is character data and not other elements.

Eddie 2009-03-20 05:02:57

I updated my asnwer in response to your comment.

Eddie 2009-03-20 05:05:07

Answer 3

+1 A:

Eddie gave a good reply, I just complete on some points that he apparently did not mention.

<?xml version="1.0" encoding="UTF-8" ?> 
<outer>
  <inner>&copy;></inner>
</outer>

is not legal (entity "copy" is not predefined, only "lt", "gt" and "quot" are, in XML).

<?xml version="1.0" encoding="UTF-8" ?> 
<outer>
  <inner>&#169;</inner>
</outer>

is perfectly legal and probably gives what you want (a copyright symbol).

<?xml version="1.0" encoding="UTF-8" ?> 
<outer>
  <inner><![CDATA[&copy;]]></inner>
</outer>

is also perfectly legal but yields a quite different result (the element <inner> will contain six Unicode characters, instead of one in the previous example).

<?xml version="1.0" encoding="UTF-8" ?> 
<!DOCTYPE outer[
<!ENTITY copy "&#169;">
]>
<outer>
  <inner>&copy;></inner>
</outer>

is legal, too, and gives the same result as the second example. It can save you from typing some characters that you use but are not easy to generate with your keyboard/editor.

<?xml version="1.0" encoding="UTF-8" ?> 
<outer>
  <inner>©</inner>
</outer>

is legal, too (because encoding="UTF-8", with encoding="US-ASCII", it would have been impossible), and gives the same result. Providing that your keyboard/editor allows you to use directly this character.

bortzmeyer 2009-03-20 16:16:44

ansaurus

tags:

views:

answers:

Are XHTML entity encodings valid in XML documents as long as they're contained inside CDATA tags?

related questions