tags:

views:

127

answers:

4

Anyone know what the best practices are or have general advice around having HTML/XHTML content within an XML element? Is it best to use CDATA or to just HTML encode the HTML?

+4  A: 

I would recommend CDATA; it will make the XML smaller and more easily human-readable.

However, make sure to escape ]]> as ]]>]]<![CDATA[.

EDIT: As other people have said, if you control the HTML that you're embedding, and you know that it will always be valid XHTML, then you should nest it directly without escaping.

However, if you don't control the HTML, I might not recommend that. Even if it's valid now, it might one day become invalid, and you do not want your system to suddenly break because of that. Obviously, this depends on the circumstances and the use case; if you want a more precise recommendation, please give us more detail.

SLaks
It’s allowed to escape `]]>` inside a CDATA block with `]]>`.
Gumbo
That's an HTML escape. If, for some unimaginable reason, he needs the literal string `]]>` in his CDATA (perhaps a Javascript string literal), he'll need to do it my way. Since he's generating it in code, it's much better not to assume that an HTML escape is suitable.
SLaks
“The right angle bracket (>) may be represented using the string " `>` ", and must, for compatibility, be escaped using either " `>` " or a character reference when it appears in the string " `]]>` " in content, when that string is not marking the end of a CDATA section.” http://www.w3.org/TR/xml/#dt-chardata
Gumbo
CDATA is employed more often than it should be. http://www.xml.com/pub/a/2003/08/20/embedded.html
Mads Hansen
@Gumbo: That's only in regular content. In a CDATA, no ampersand escapes are recognized at all. http://www.w3.org/TR/xml/#dt-cdsection (I tested this in .Net's XML parser)
SLaks
@Mark Hansen: I fully agree. However, a large amount of HTML content out there is invalid XHTML. Obviously, this depends on what the HTML is, but it's quite likely that he's getting HTML from something outside his control, in which case it's probably invalid. Even if it is valid right now, it might one day become invalid, and he probably doesn't want his system to break because of it. However, if he does control the HTML, he should nest it directly, as other people have said.
SLaks
+2  A: 

Third option: Having HTML normally embedded in XML is far more flexible than encoding it, or embedding it with CDATA. It allows parsers to handle the entire document including the HTML in a high-level way. It allows use of XSL transformations on both the containing XML and the HTML data.

So I'd suggest directly embedding it unless your HTML is not valid XML, in which case encoding or CDATA would be the only option anyway.

Joren
A: 

I'd go with namespaced XHTML directly in the document (as opposed to "as a string", which is what the two options you propose offer).

If you don't do that, then it makes no difference which of the you use,

David Dorward
I'm using this with the a sub-element of the <content> portion of an Atom publishing formatted XML document.... would this work in this case? I don't see why it wouldn't.
Kevin M
And the HTML needs to remain intact when retrieving it out of the XML. Using normal XML processing, the HTML tags would be ignored and just the text within tags in the HTML would be accessible right? That's not the desired effect. I need the HTML intact for use in creating an HTML page for viewing in a browser later.
Kevin M
I generate Atom with namespaced XHTML in it and have no troubles. "Throwing out the tags" is not "normal XML processing". Atom explicitly supports this approach: http://atompub.org/rfc4287.html#rfc.section.4.1.3.4
David Dorward
+1  A: 

Since a lot of HTML is incorrectly formed as XML (i.e., missing end tags like </p>, </li>, and <br/>), it may be less work to simply use a CDATA wrapper.

It depends on where you're getting the HTML from. If you're generating it yourself, you have total control over its form, but if you're pulling it from some other source (e.g., extracting it from some other web site) you probably don't have the luxury of reformatting it to be XHTML compliant.

Loadmaster