views:

667

answers:

7

We call an API that returns a string of XML-formatted data. We'd like to convert this string into a ColdFusion XML object, via XMLParse(). A problem occurs when special characters show up in the data values. For example, characters like this:

  — –

(yes, the raw data contains them in their HTML encoded equivalent). When doing the XMLParse(), it throws an error on these encoded characters. Here is an example that will error:

Part of our string: <event>Hello &nbsp; World</event>

Error: Reference to undefined entity "&nbsp;"

What's the best method to make these characters compatible with the XMLParse()? And even more important - how can we do this if we don't always know what the characters will be?

Thanks!

(this is on a ColdFusion 6 server)

+1  A: 
replace(xml, '&','&amp;','all');

should allow it to be validated. You can also use a DTD to define these characters but as you stated you don't always know that the characters will be - I would probably just do the replace.

Nick
A: 

You might take a look at XmlFormat(). Easy to use:

<cfset string = XmlFormat(string)>
pb
How is that going to help? The whole string will end up double-encoded, which makes it even less fit for XML parsing than before.
Tomalak
Tomalak is correct...this causes more of a problem because XMLParse() will get hung up on these new chars.
Alex
yikes, yeah, this will escape the entities too so it's no longer XML. I'll read a little closer next time.
pb
+2  A: 

I would recommend:

ReplaceList(xml, "&nbsp;,&mdash;,&ndash;", "#Chr(160)#,#Chr(8212)#,#Chr(8211)#")

Wikipedia seems to have a quite complete list of character entities and their char codes. I would opt for using Chr() to create the replacement string, this way you can be unambiguous and independent of source-code file encoding.

Tomalak
The only drawback: if you need to know that something was an entity vs. a specific character, this method will obscure that info.
pb
An entity and a specific character are *the same thing*, just serialized differently. There is no need to make such a distinction unless you're doing something wrong. From the DOM's point of view, a `©` is a `©`, no matter if you used `©`, ``, `©` or a literal `©` to express it in the HTML source.
Tomalak
+1  A: 

This seems to be a pretty good function to remove extended characters and replace them with their HTML equivalent.

http://www.petefreitag.com/item/202.cfm

Jason
+1  A: 

See this related question: http://stackoverflow.com/questions/1646839/decode-numeric-html-entities-in-coldfusion

Use that, and then XmlFormat() it, then XmlParse() it.

"nbsp is not one of the 5 predefined character entity references", @stevenerat said.

Henry
+1  A: 

Yup, nbsp is not one of the predefined character entity referneces and needs to be escaped with xmlformat() such as xmlparse(xmlformat(theString)).

http://en.wikipedia.org/wiki/List%5Fof%5FXML%5Fand%5FHTML%5Fcharacter%5Fentity%5Freferences http://livedocs.adobe.com/coldfusion/7/htmldocs/00000668.htm

Steven Erat
Using `XmlFormat()` on `theString` would change ` ` to ` `. You *could* possibly parse the string as XML after that, but the data the string contained would be changed. Apart from that, `XmlFormat()` would change `<` to `<` etc. - so you would turn all the HTML markup to text. Not what the OP intends, I think.
Tomalak
Alex
A: 

Replacing the "&" with "&amp;" and then back again after parsing seems to work

<cfsavecontent variable="xmlString">
    <event>Hello&nbsp;World&amp;</event>
</cfsavecontent>
<cfset xmlString = Replace(xmlString, "&", "&amp;", "all") />
<cfset doc = Xmlparse(xmlString) />
<cfset value = Replace(doc.event.xmlText, "&amp;", "&","all") />
Lucas Moellers