Your XML parser is not lying. That's an invalid (not even well-formed) document that you won't be able to load with anything.
rsquo
is a predefined entity in HTML, but not in XML. In an XML document if you want to use anything but the most basic built-in entities (amp
, lt
, gt
, quot
and apos
) you must define them in a DTD referenced by a <!DOCTYPE>
declaration. (This is how XHTML does it.)
You need to find out where the input came from and fix whatever was responsible for creating it, because at the moment it's simply not XML. Use a character reference (’
) or just the plain literal character ’
in UTF-8 encoding.
As a last resort if you really have to accept this malformed nonsense for input you could do nasty string replacements over the file:
$xml= file_get_contents($_FILES['file']['tmp_name']);
$xml= str_replace('’', '’', $xml);
$dom->loadXML(xml);
If you need to do this for all the non-XML HTML entities and not just rsquo
that's a bit more tricky. You could do a regex replacement:
function only_html_entity_decode($match) {
if (in_array($match[1], array('amp', 'lt', 'gt', 'quot', 'apos')))
return $match[0];
else
return html_entity_decode($match[0], ENT_COMPAT, 'UTF-8');
}
$xml= preg_replace_callback('/&(\w+);/', 'only_html_entity_decode', $xml);
This still isn't great as it's going to be mauling any sequences of &\w+;
characters inside places like comments, CDATA sections and PIs, where this doesn't actually mean an entity reference. But it's probably about the best you can do given this broken input.
It's certainly better than calling html_entity_decode
over the whole document, which will also mess up any XML entity references, causing the document to break whenever there's an existing &
or <
.
Another hack, questionable in different ways, would be to load the document using loadHTML()
.