views:

1491

answers:

5

I have the following block of HTML:

<p>The quick brown fox jumps over the lazy dog &mdash; The quick brown fox jumps over the lazy dog.</p>
<p>The quick brown fox jumps over the lazy dog &mdash; The quick brown fox jumps over the lazy dog.
<br>The quick brown fox jumps over the lazy dog &mdash; The quick brown fox jumps over the lazy dog.

It is NOT valid XHTML. However, I need to include this HTML in an XML document. I tried using XMLFormat() in order to convert the < to &lt; and the > to &gt;, which works great. Unfortunately, it also converts &mdash; to &amp;mdash;, which is not valid and throws an exception in the CFXML tag.

<cfxml variable="myXML">
    <content>#XMLFormat(myHTML)#</content>
</cfxml>

How can I workaround this?

+1  A: 

It's tough when you have some HTML partially converted, and then need to do the rest...

You could replace all the "&" signs temporarily, run the XMLFormat, then convert the "&" signs back.

<cfscript>
// replace & signs with a temp placeholder
myHTML = replace(myHTML, "&", "*amp*", "all");

// format for XML
myHTML = XMLFormat(myHTML);

// replace placeholders with & signs
myHTML = replace(myHTML, "*amp*", "&", "all");
</cfscript>

If it works, you could make this one step by wrapping this logic in a single function.

Dan Sorensen
+1  A: 

How about simply not using &mdash; escape in the source string and instead including the — character in-situ.

Edit:

I'm gonna guess that the HTML content stored in the database is not known to be XHTML compliant and hence to put it in an XML document you have no choice but to either place it in a CDATA section or encode it correctly. There is an assumption that placing it in an XML document like this is useful and that it can be properly decoded at the consuming end. This will be true of either approach if a typical XML DOM is used at the consumer.

So this leads me to this quesion, whats actually wrong with &amp;mdash? After all < will result in &lt; etc. When retrieved from a DOM by the consumer the resulting string will be returned to using &mdash; and < and so on, when subsequently used in as HTML all will be well.

AnthonyWJones
This is existing content for a client which I am not at liberty to edit.
Eric Belair
+4  A: 

You have a few options. A lot depends on how this content is going to be used. It would be extremely helpful to include a desired output document, as well as indicate where this xml is being used.

If you don't want to mess with the content of the HTML at all, you could always use CDATA, like this:

<cfxml variable="myXML">
    <content><![CDATA[#myHTML#]]></content>
</cfxml>

Also, I know you say you don't want to convert the remaining ampersands but I just don't see how this is so. Either the HTML content is a string you want to process -- in which case, all of it should be escaped so that it can be unescaped later -- or it's valid XML that you want to be part of the document. I mean, when you process the contents of the <content> tag later on, you will run into problems if the ampersands aren't escaped.

Jordan Reiter
I am getting the content out of a SQL Server database and putting it in an XML document so that it can be imported (along with a lot of other meta data) into a CMS. CDATA is not an option....
Eric Belair
@Eric: Why is CDATA not an option?
AnthonyWJones
What kind of CMS? Basically none of this makes sense. If you're importing the text, then all of it must be escaped, including the . — is totally valid and should not throw an exception in the CFXML tag. You are probably doing something wrong.
Jordan Reiter
@Jordan, I believe it's Interwoven.@Anthony, I'm not sure why CDATA is not an option, but I think the CMS import script - out of my control - is not setup to handle it.
Eric Belair
Okay, so Interwoven is going to import all of the text between the <content></content> tags? Is it then going to unscape it into HTML? If so, then yes, you HAVE to XMLFormat everything.
Jordan Reiter
A: 

For the time being, I'm simply going to replace all less-than and greater-than characters with "&lt;" and "&gt;", respectively.

Eric Belair
A: 

In this specific use case, you can use URLEncodedFormat() to preserve the natural form of the content, and then use URLDecode() on the way out.

<cfxml variable="content">
    <content><cfoutput>#URLEncodedFormat(myHTML)#</cfoutput></content>
</cfxml>
<cfset xml = xmlParse(content)>
<cfoutput>#URLDecode(xml.content.xmltext)#</cfoutput>

I'm not recommending this as a best practice, only that it would work in the scenario posed by the question.

jalpino