views:

23

answers:

1

I am reading the documentation for creating a podcast feed suitable for iTunes, and the Common Mistakes section says:


Using HTML Named Character Entities.

<! — illegal xml — >
<copyright>&copy; 2005 John Doe</copyright>

<! — valid xml — >
<copyright>&#xA9; 2005 John Doe</copyright>

Unlike HTML, XML supports only five "named character entities":

character   name               xml
&           ampersand          &amp;
<           less-than sign     &lt;
>           greater-than sign  &gt;
’           apostrophe         &apos;
"           quotation          &quot;

The five characters above are the only characters that require escaping in XML. All other characters can be entered directly in an editor that supports UTF-8. You can also use numeric character references that specify the Unicode for the character, for example:

character   name                       xml
©           copyright sign             &#xA9;
℗           sound recording copyright  &#x2117;
™           trade mark sign            &#x2122;

For further reference see XML Character and EntityReferences.


Right now I'm using htmlentities() under PHP5 and the feed is validating and working. But from what I gather some things that could get put into content might become entities that would make it no longer be valid. What's the best function to use to assure I'm not passing along bad data? I'm paranoid something will get entered and get entity-ized and break the feed -- should I just use str_replace() and replace with named entities and leave the rest alone? Or can I use htmlspecialchars() somehow?

So in short, what's a drop-in replacement for htmentities() that will make sure input is safe for description, titles, etc in a podcast RSS feed?

+2  A: 

You can either:

  • Use a CDATA block instead (just make sure you're using the correct encoding, i.e., the encoding of the XML file matches the encoding of the data). The only think you have to lookout for is ]]>, which cannot be put literally in a CDATA block.
  • Use mb_encode_numericentity instead of htmlentities (possibly combined with htmlspecialchars and a previous decoding of html entites with mb_convert_encoding).

If the encoding of the XML file is UTF-8, you can just remove the entities. Suppose you have the following HTML fragment:

&copy; 2005 John Doe

Then, you could just do:

$data = "&copy; 2005 John Doe";
$data = mb_convert_encoding($data, "UTF-8", "HTML-ENTITIES");
$data = htmlspecialchars($data, ENT_NOQUOTES, "UTF-8");
Artefacto
Their specs specifically say "CDATA sections are strongly discouraged." So that's out. If I use `mb_encode_numericentity` http://us3.php.net/mb_encode_numericentity , what am I passing in as the 2nd and 3rd parameters: `array $convmap , string $encoding` ? I'm guessing `$encoding` would be 'UTF-8'
artlung
@art I've edited the answer to address your concerns :p
Artefacto
So it looks like I would go ahead and keep running `htmlentities()` before I run your `mb_convert_encoding()` and `htmlspecialchars()` then? Those two calls basically "xml-ize" the encodings to match, true?
artlung
@art You may or may not need it. Depends on the original encoding of the data. To be safe, yes, call `htmlentites` and pass it the correct initial encoding of the HTML data and the "no double encoding" argument and then call `mb_convert_encoding` + `htmlspecialchars`. Note that if your initial data contains no html entities whatsoever, then your escaping task is as simple as a call to `htmlspecialchars`.
Artefacto
Thanks @Artefacto. I'm adding it, and will be keeping an eye on how well the feed validates.
artlung