views:

280

answers:

6

I am using XMLFormat() to encode some text for an XML document. However, when I go to read the XML file I created I get an invalid character error. Why does XMLFormat() not properly encode all characters?

I'm running CF8.

+3  A: 

Are you sure to output the file in the right encoding? You can't just do

<cffile action="write" file="foo.xml" output="#xml#" />

as the result very likely diverges from the character set your XML is in. Unless otherwise noted (by an encoding declaration), XML files are treated as UTF-8, and you should do:

<cffile action="write" file="foo.xml" output="#xml#" charset="utf-8" />
<!--- and --->
<cffile action="read" file="foo.xml" variable="xml" charset="utf-8" />
Tomalak
I'm trying to use cfheader and cfcontent to serve the xml document as an actual xml document.
Jason
So there is no safe to/load from disk part involved on the server side? If that's the case, how is the file served (check with HeaderSpy, for example)? Do file declaration and served encoding match?
Tomalak
Also, have you considered DOM functions (`XmlNew()` et al.) to build the file, instead of string concatenation and `XmlFormat()`?
Tomalak
A: 

Do not forget also to put <cfprocessingdirective pageencoding="utf-8"> on top of your template.

rparente
This is probably useless, as it only describes the encoding the CFML source file itself is in. Most of the time, there is no need to set `pageencoding`.
Tomalak
+1  A: 

I feel that this is a bug in XMLFormat. I am not sure who the original author of the snippet below is but here is an approach to catch the extra characters via regex...

  <cfset myText = xmlFormat(myText)>

  <cfscript>
      i = 0;
      tmp = '';
      while(ReFind('[^\x00-\x7F]',myText,i,false))
      {
        i = ReFind('[^\x00-\x7F]',myText,i,false); // discover high chr and save it's numeric string position.
        tmp = '&##x#FormatBaseN(Asc(Mid(myText,i,1)),16)#;'; // obtain the high chr and convert it to a hex numeric chr.
        myText = Insert(tmp,myText,i); // insert the new hex numeric chr into the string.
        myText = RemoveChars(myText,i,1); // delete the redundant high chr from string.
        i = i+Len(tmp); // adjust the loop scan for the new chr placement, then continue the loop.
      }
      return myText;
  </cfscript>
kevink
A: 

if your trying to return your XML directly to the browser, you might want to try something like for the user to download it

<cfheader name="Content-Disposition" charset="utf-8" value="attachment; filename=export.xml">
<cfcontent variable="#someXMLPacket#" type="text/xml"  reset="true">

or, if you want it returned as a webpage (ala REST) then this should do the trick

<cfheader charset="utf-8">
<cfcontent variable="#someXMLPacket#" type="text/xml"  reset="true">

hope that helps

LucasS
A: 

Unfortunately, XMLFormat is just not an all-inclusive solution. It has a very limited list of characters that it will replace [documentation].

You'll need to do custom encoding of characters that are invalid for XML but not covered by XMLFormat.

It's definitely not very efficient, but a potential solution would be to loop over the content of typically-suspect fields (anything user-generated, for starters) character-by-character, checking the ascii code, and if it's above 255, either omit the character or properly encode it.

Adam Tuttle
First, non-ASCII characters aren't the issue per se, since XML was designed with Unicode in mind, and is assumed to be UTF-8 text unless otherwise noted. Second, the range of sneaky Windows characters that tend to produce the most trouble are less than 255; the troublesome quotation marks, in particular, are 145-148.
Sixten Otto
A: 

This was a huge issue for me as well, and it turns out charset is the main factor, you need to clearly specify the correct charset.

For me I was having foreign languages inside xml, and wouldn't be parsed correctly until i put in the correct charset...

crosenblum