views:

473

answers:

3

I'm building an XML file from scratch and need to know if htmlentities() converts every character that could potentially break an XML file (and possibly UTF-8 data)? The values will be from a twitter/flickr feed, so I need to be sure!

+4  A: 

htmlentities() is not a guaranteed way to build legal XML.

Use htmlspecialchars() instead of htmlentities() if this is all you are worried about. If you have encoding mismatches between the representation of your data and the encoding of your XML document, htmlentities() may serve to work around/cover them up (it will bloat your XML size in doing so). I believe it's better to get your encodings consistent and just use htmlspecialchars().

Also, be aware that if you pump the return value of htmlspecialchars() inside XML attributes delimited with single quotes, you will need to pass the ENT_QUOTES flag as well so that any single quotes in your source string are properly encoded as well. I suggest doing this anyway, as it makes your code immune to bugs resulting from someone using single quotes for XML attributes in the future.

Edit: To clarify:

htmlentities() will convert a number of non-ANSI characters (I assume this is what you mean by UTF-8 data) to entities (which are represented with just ANSI characters). However, it cannot do so for any characters which do not have a corresponding entity, and so cannot guarantee that its return value consists only of ANSI characters. That's why I 'm suggesting to not use it.

If encoding is a possible issue, handle it explicitly (e.g. with iconv()).

Edit 2: Improved answer taking into account Josh Davis's comment belowis .

Jon
Do not use `htmlentities` for XML; it’s intended for HTML and not XML. XML does only know the five entities *amp*, *lt*, *gt*, *apos* and *quot*. But `htmlentities` will use a lot more (those that are registered for HTML).
Gumbo
Thanks for the thorough explanation and note on using ENC_QUOTES!
John Himmelman
The statement "it will make your XML guaranteed legal" **couldn't be more wrong** though. As mentionned above, htmlentities() uses entities that are not defined in XML. In addition, it does not sanitize bytes that are not supposed to appear in an XML document, such as the NUL byte. It doesn't sanitize invalid UTF-8 either, so in some cases it might become impossible for XML parsers to the resulting document.
Josh Davis
@Josh: +1 well said, I was unaware that the pool of predefined entities in XML is smaller. On the other hand, I think that expecting your incoming data from twitter/flickr to be correctly encoded (in whatever encoding) and not contain null bytes are both reasonable assumptions. You can certainly explicitly test them for safety, but it's not directly related to the original question.
Jon
+3  A: 

Dom::createTextNode() will automatically escape your content.

Example:

$dom = new DOMDocument;
$element = $dom->createElement('Element');
$element->appendChild(
    $dom->createTextNode('I am text with Ünicödé & HTML €ntities ©'));

$dom->appendChild($element);
echo $dom->saveXml();

Output:

<?xml version="1.0"?>
<Element>I am text with &#xDC;nic&#xF6;d&#xE9; &amp; HTML &#x20AC;ntities &#xA9;</Element>

Note that the above is not the same as setting the second argument $value in Dom::createElement(). The method will only make sure your element names are valid. See the Notes on the manual page.

Gordon
+2  A: 

So your question is "is htmlentities()'s result guaranteed to be XML-compliant and UTF-8-compliant?" The answer is no, it's not.

htmlspecialchars() should be enough to escape XML's special characters but you'll have to sanitize your UTF-8 strings either way. Even if you build your XML with, say, SimpleXML, you'll have to sanitize the strings. I don't know about other librairies such as XMLWriter or DOM, I think it's the same.

Josh Davis