I am using XMLFormat() to encode some text for an XML document. However, when I go to read the XML file I created I get an invalid character error. Why does XMLFormat() not properly encode all characters?
I'm running CF8.
I am using XMLFormat() to encode some text for an XML document. However, when I go to read the XML file I created I get an invalid character error. Why does XMLFormat() not properly encode all characters?
I'm running CF8.
Are you sure to output the file in the right encoding? You can't just do
<cffile action="write" file="foo.xml" output="#xml#" />
as the result very likely diverges from the character set your XML is in. Unless otherwise noted (by an encoding declaration), XML files are treated as UTF-8, and you should do:
<cffile action="write" file="foo.xml" output="#xml#" charset="utf-8" />
<!--- and --->
<cffile action="read" file="foo.xml" variable="xml" charset="utf-8" />
Do not forget also to put <cfprocessingdirective pageencoding="utf-8"> on top of your template.
I feel that this is a bug in XMLFormat. I am not sure who the original author of the snippet below is but here is an approach to catch the extra characters via regex...
<cfset myText = xmlFormat(myText)>
<cfscript>
i = 0;
tmp = '';
while(ReFind('[^\x00-\x7F]',myText,i,false))
{
i = ReFind('[^\x00-\x7F]',myText,i,false); // discover high chr and save it's numeric string position.
tmp = '&##x#FormatBaseN(Asc(Mid(myText,i,1)),16)#;'; // obtain the high chr and convert it to a hex numeric chr.
myText = Insert(tmp,myText,i); // insert the new hex numeric chr into the string.
myText = RemoveChars(myText,i,1); // delete the redundant high chr from string.
i = i+Len(tmp); // adjust the loop scan for the new chr placement, then continue the loop.
}
return myText;
</cfscript>
if your trying to return your XML directly to the browser, you might want to try something like for the user to download it
<cfheader name="Content-Disposition" charset="utf-8" value="attachment; filename=export.xml">
<cfcontent variable="#someXMLPacket#" type="text/xml" reset="true">
or, if you want it returned as a webpage (ala REST) then this should do the trick
<cfheader charset="utf-8">
<cfcontent variable="#someXMLPacket#" type="text/xml" reset="true">
hope that helps
Unfortunately, XMLFormat
is just not an all-inclusive solution. It has a very limited list of characters that it will replace [documentation].
You'll need to do custom encoding of characters that are invalid for XML but not covered by XMLFormat
.
It's definitely not very efficient, but a potential solution would be to loop over the content of typically-suspect fields (anything user-generated, for starters) character-by-character, checking the ascii code, and if it's above 255, either omit the character or properly encode it.
This was a huge issue for me as well, and it turns out charset is the main factor, you need to clearly specify the correct charset.
For me I was having foreign languages inside xml, and wouldn't be parsed correctly until i put in the correct charset...