tags:

views:

681

answers:

8

I often see people asking XML/XSLT related questions here that root in the inability to grasp how CDATA works (like this one).

I wonder - why does it exist in the first place? It's not that XML could not do without it, everything you can put into a CDATA section can be expressed as "native" (XML-escaped).

I appreciate that CDATA potentially makes the resulting document a bit smaller, but let's face it - XML is verbose anyway. Small XML documents can be achieved more easily through compression, for example.

For me, CDATA breaks the clean separation of markup and data since you can have data that looks like markup to the unaided eye, which I find is a bad thing. (This may even be one of the things that encourages people to inadequately apply string processing or regex to XML.)

So: What good reason is there to use CDATA?

+6  A: 

CDATA sections are just for the convenience of human authors, not for programs. There only use is to give humans the ability to easily include e.g. svg code in a xhtml page without needing to carefully replacing every < with &lt; and so on.

That is for me the intended use. Not to make the resulting document a few bytes smaller because you can use < instead of &lt;.

Also again taking the sample from above (svg code in xhtml) it makes it easy for me to check the source code of the xhtml file and just copy-paste the svg code out without again needing to back-replace &lt; with <.

jitter
I think it depends on whether you use text editors to manipulate XML or more appropriate tools like DOM APIs. I understand the convenience argument, though.
Tomalak
Also - inserting SVG code into an X(HT)ML document into a CDATA section somehow defies the purpose, doesn't it? I mean - it would degrade perfectly well-formed XML to mere text data…
Tomalak
@Tomalak exactly. That's what I mean by convenient for humans. If somebody manually edits some xml.
jitter
@svg in cdata defies:purpose: Umm no it doesn't why?. What I meant was a page which talks about SVG and thus has to show pieces of sample svg-code not to display the svg. Here cdata-sections make it easy to include the svg code without reformatting it
jitter
+3  A: 

To me CDATA is just another word for lazy. When i started out with XML i used it, but nowadays i always convert data.

The best reason i can come up with is, convenience. Especially when you are using XML as some form of wrapper, to transport data from one system to another, in this case you may end up with the following.

Create XML wrapper
Convert data to XML
Put data inside wrapper
Send XML to receiver
Split XML to XML + Data in XML
Convert Data in XML to Data

Whereas using CDATA would result in not requiring the extra conversion steps.

Another usage could be to embed data without having to care about the different namespaces in the embedded data. But that is not really a great way to use it.

I've found another example of a good way to use CDATA, one that i should have thought of. It's the case when you need to embed code in an XML-file, the code is not supposed to be converted or it will not work and/or will not be easily readable.

Peter Lindqvist
+1  A: 

MXML demonstrates a great use of CDATA tags. One of the things I like about MXML is it is valid XML, meaning I can do useful things like generate flash widgets programmatically from a different XML file using a transform, and validate MXML against a schema.

CDATA tags are useful in MXML because they to define an ActionScript script block within an MXML file, allowing me to combine an ECMA type scripting language (with > and < and the like) and valid XML in a single file.

EDIT:

I suppose another option to combine MXML and ActionScript would be to combine them in the way you combine HTML and Javascript, and that is to wrap the script in an XML comment tag inside the script block, and the choice to use CDATA instead was made by the developers of the MXML compiler. I suppose the reasoning probably has more to do with editing, as the MXML editor validates your code against a schema to check syntax and provide context help, as well parsing your actionscript code for syntax and context help. Using CDATA in the editor allows it to do both and differentiate between XML comments and script blocks.

Ryan Lynch
A: 

I believe that CDATA was intended to allow raw binary data: as long as it doesn't contain "]]>" then anything goes in a CDATA section. This does set it apart from normal XML and should speed up parsing (and negate the necessity for full text encoding, thus giving a second performance boost). Actually it proved quite problematic what with people not escaping the closing sequence and several early parsers being variously broken, so most now just use a text encoding for binary data, making the CDATA section somewhat pointless, yes.

EDIT: note that this answer is in fact wrong, as Tomalak identifies in comments. I've not deleted it because I know there are other people out there who think that raw binary is acceptable in CDATA and this might clear up that little misunderstanding.

sinibar
But CDATA means character data, I doubt that you can put in raw byte sequences that are otherwise illegal in XML.
Tomalak
Oh yes you can!The binary data tends to break other things in the chain though!The main reason for still using CDATA is to preserve formatting of text, as in newlines and tabs and sequences of spaces, which get lost when parsing normal tabs.
James Anderson
Aslo its a mere 134,217,728 to one chance that ]]> will appear somewhere in your binary data!
James Anderson
The spec (http://www.w3.org/TR/REC-xml/#sec-cdata-sect) says CData can contain characters (http://www.w3.org/TR/REC-xml/#charsets). Sorry, but that does not look like binary was allowed to me. Maybe there is some odd XML parser that allows it, but it surely is not the way it was meant to be.
Tomalak
I checked and yes, Tomalak is correct (or rather that's my reading of the Fifth Edition of the XML 1.0 spec).Either my original understanding predates XML 1.0 (entirely likely) or I was misinformed (equally likely).
sinibar
@sinibar: I suggest you make that answer a community wiki (you can do so in edit mode). Some people down-vote "wrong" answers regardless whether you have pointed out the mistake or not. In wiki mode, this won't cause you any rep loss, at least.
Tomalak
+2  A: 

When in doubt, check the spec:

2.7 CDATA Sections

[Definition: CDATA sections may occur anywhere character data may occur; they are used to escape blocks of text containing characters which would otherwise be recognized as markup.

NickFitz
@NickFitz: I'm aware of the basic facts. ;-) I was asking what the *benefit* of CDATA over XML-escaping would be.
Tomalak
And the spec tells you: they are used for escaping blocks of text containing characters which would otherwise be recognised as markup. The corollary of this is that they can be used when for some reason it is impractical, impossible, or undesirable to escape markup characters using entities. Therefore the benefit is that CDATA sections provide an alternative to escaping. Devising actual use-cases is left as an exercise for the reader ;-)
NickFitz
You *are* the reader, this is your exercise as posed by Tomalak. :-P
Andrzej Doyle
@dtsazza: *chuckles*
Tomalak
A: 

CDATA sections are really useful when you want to define a schema for some XML but part of it is out of your control and you can't ensure that it will meet the schema or won't break the XML.

I often have to work with legacy systems that have HTML outputs that are often not well formed XHTML, I can attach a schema that ensures that the XML is structered correctly but have a tag that just contains a CDATA section for housing the potentially bad HTML within CDATA.

It's not a common usage but it definitely has it's uses when you don't want other people's lax programming to break your system.

colethecoder
But you could just use the HTML outputs as the node value and they would work equally well, only that they appear as XML-escaped.
Tomalak
Yes but that incurs a performance cost of having to convert to escaped HTML and then back out again, probably minor in a lot of use cases but within a transport mechanism particularly one with high load it is potentially significant. Also as I highlight, when working with legacy systems it can be dangerous to assume that they can escape the characters let alone that they will consistently.
colethecoder
+1  A: 

I don't know how helpful this will be, but I'll throw this in too:

One of the issues is that there are a couple of distinct camps of XML developers, where some view XML as a representation of data, and some view it in a more document-centric way. (The beauty of XML is that it works well for both.)

Those who view XML as a representation of data--where the XML is often being produced and consumed by tools, and humans only get involved for troubleshooting--will see little value in a CDATA section, because it doesn't make a difference to their tools, whereas those who use XML for more document-centric purposes may find CDATA sections much more useful.

sernaferna
+2  A: 
Madhan