views:

100

answers:

4

I have searched stackoverflow on this problem and did find a few topics, but I feel like there isn't really a solid answer for me on this.

I have a form that users submit and the field's value is stored in a XML file. The XML is set to be encoded with UTF-8.

Every now and then a user will copy/paste text from somewhere and that's when I get the "entity not defined error".

I realize XML only supports a select few entities and anything beyond that is not recognized - hence the parser error.

From what I gather, there's a few options I've seen:

  1. I can find and replace all   and swap them out with   or an actual space.
  2. I can place the code in question within a CDATA section.
  3. I can include these entities within the XML file.

What I'm doing with the XML file is that the user can enter content into a form, it gets stored in a XML file, and that content then gets displayed as XHTML on a Web page (parsed with SimpleXML).

Of the three options, or any other option(s) I'm not aware of, what's really the best way to deal with these entities?

Thanks, Ryan

UPDATE

I want to thank everyone for the great feedback. I actually determined what caused my entity errors. All the suggestions made me look into it more deeply!

Some textboxes where plain old textboxes, but my textareas were enhanced with TinyMCE. It turns out, while taking a closer look, that the PHP warnings always referenced data from the TinyMCE enhanced textareas. Later I noticed on a PC that all the characters were taken out (because it couldn't read them), but on a MAC you could see little square boxes referencing the unicode number of that character. The reason it showed up in squares on a MAC in the first place, is because I used utf8_encode to encode data that wasn't in UTF to prevent other parsing errors (which is somehow also related to TinyMCE).

The solution to all this was quite simple:

I added this line entity_encoding : "utf-8" in my tinyMCE.init. Now, all the characters show up the way they are supposed to.

I guess the only thing I don't understand is why the characters still show up when placed in textboxes, because nothing converts them to UTF, but with TinyMCE it was a problem.

Thanks, Ryan

+1  A: 
Tomalak
"You could HTML-parse the text and have it re-escaped with the respective numeric entities" - does that mean you can always store numeric entities over HTML text entities? -Ryan
Ryan S.
@Ryan: Yes, numeric entities are allowed in (and recognized by) both XML and HTML.
Tomalak
@Tomalak That means I would have to know all the entities by name and their numeric entity beforehand, right? Is that going to be extremely processing intensive if I add them all in there? -Ryan
Ryan S.
@Ryan: There are functions that know all the entity names, you don't have to do that manually. That's what I meant by "HTML-parse". Use an HTML parser for this kind of work.
Tomalak
@Tomalak In one of your paragraphs you suggested that you can store the actual character, so technically, before writing it to the XML file, could I just use html_entity_decode to get the character? -Ryan
Ryan S.
@Tomalak When you say to use a HTML parser, is that something that's available PHP natively, or do I need a separate "plugin"? If so, can you recommend one? -Ryan
Ryan S.
@Ryan: http://php.net/manual/en/class.domdocument.php.
Tomalak
A: 
JapanPro
Of course it isn't. Vanilla XML will not recognize ` ` no matter what encoding you set.
Tomalak
This is a bum steer. Undefined entities are still undefined regardless of encoding.
LarsH
quest is why undefined, most of the time its due to encoding. it break unwanted and show unwanted undefined.
JapanPro
@JapanPro: When it says `"entity not defined"` then it is *definitely not* an encoding problem.
Tomalak
i not arguing here. its not definite but most of the time, as first debug this could be option.
JapanPro
+1 well, it was an encoding issue after all
Thariama
+1  A: 

If you want to convert all caharackers this may help you (i wrote it a while back: )

http://www.lautr.com/convert-all-applicable-characters-to-numeric-entities-for-use-in-xml

function _convertAlphaEntitysToNumericEntitys($entity){
return '&#'.ord(html_entity_decode($entity[0])).';';
}
$content = preg_replace_callback('/&([\w\d]+);/i','_convertAlphaEntitysToNumericEntitys',$content);
function _convertAsciOver127toNumericEntitys($entity){
if(($asciCode = ord($entity[0])) > 127){
return '&#'.$asciCode.';';
}else{
return $entity[0];
}
}
$content = preg_replace_callback('/[^\w\d ]/i','_convertAsciOver127toNumericEntitys'), $content);
Hannes
well,if you apply "$content = preg_replace_callback('//i','_convertAlphaEntitysToNumericEntitys',$content);" all HTML entity (  and whatnot) would be changed to numeric entities. After that apply "$content = preg_replace_callback('/[^\w\d ]/i','_convertAsciOver127toNumericEntitys'), $content);" and every character above 127 (which is not handled by htmlspecialchars ) is converted into a numeric entity, if I understand it wrong can you please give an example snippet of Input?
Hannes
@Hannes, sorry, I misunderstood what your code did. Deleting my earlier comment.
LarsH
+1  A: 

1. I can find and replace all [ ?] and swap them out with [ ?] or an actual space.

This is a robust method, but it requires you to have a table of all the HTML entities (I assume the pasted input is coming from HTML) and to parse the pasted text for entity references.

2. I can place the code in question within a CDATA section.

In other words disable parsing for the whole section? Then you would have to parse it some other way. Could work.

3. I can include these entities within the XML file.

You mean include the entity definitions? I think this is an easy and robust way, if you don't mind making the XML file quite a bit bigger. You could have an "included" file (find one on the web) which is an external entity, which you reference from the top of your main XML file.

One downside is that the XML parser you use has to be one that processes external entities (which not all parsers are required to do). And it must correctly resolve the (possibly relative) URL of the external entity to something accessible. This is not too bad but it may increase constraints on your processing tools.

4. You could forbid non-XML in the pasted content. Among other things, this would disallow entity references that are not predefined in XML (the 5 that Tomalak mentioned) or defined in the content itself. However this may violate the requirements of the application, if users need to be able to paste HTML in there.

5. You could parse the pasted content as HTML into a DOM tree by setting someDiv.innerHTML = thePastedContent; In other words, create a div somewhere (probably display=none, except for debugging). Say you then have a javascript variable myDiv that holds this div element, and another variable myField that holds the element that is your input text field. Then in javascript you do

myDiv.innerHTML = myField.value;

which takes the unparsed text from myField, parses it into an HTML DOM tree, and sticks it into myDiv as HTML content.

Then you would use some browser-based method for serializing (= "de-parsing") the DOM tree back into XML. See for example this question. Then you send the result to the server as XML.

Whether you want to do this fix in the browser or on the server (as @Hannes suggested) will depend on the size of the data, how quick the response has to be, how beefy your server is, and whether you care about hackers sending not-well-formed XML on purpose.

LarsH
@Tomalak - why would ö become ö? When the text is put into innerhtml, won't it get parsed into the dom as a single character o-umlaut?
LarsH
1. Would probably be too much overhead, right? 2. On second thought, this seems counterproductive, so I'm going to eliminate that option. 3. Besides the file being bigger, are there other downsides? If not, I'd say that's the way to go. 4. Yes, that would violate the requirements. 5. I don't understand this solution - can you provide more details? -Ryan
Ryan S.
@Ryan, I'll edit my answer to add details on 3 and 5.
LarsH
@LarsH Thank you for doing that! 3. The question is: would it cost more processor time string replacing these values or embedding a DTD that checks for entities? 5. OK, I understand now. I would like to do this on the server. -Ryan
Ryan S.
@Ryan - replacing the values yourself is probably faster, since DTD processing is much more general. But you'd have to test it to know for sure.
LarsH