views:

33

answers:

3

Hi, i have a problem to load a xml-file with php. I use DOMDocument, because i need the function getElementsByTagName.
I use this code.

$dom = new DomDocument('1.0', 'UTF-8');
$dom->resolveExternals = false;
$dom->load($_FILES["file"]["tmp_name"]);

<?xml version="1.0" encoding="UTF-8"?>
<Data>
  <value>1796563</value>
  <value>Verliebt! &rsquo;</value>
</Data>

ErrorMessage:
Warning: DOMDocument::load() [domdocument.load]: Entity 'rsquo' not defined in /tmp/php1VRb3N, line: 4 in /www/htdocs/bla/upload.php on line 51

+1  A: 

In order to use that Entity, it must be defined in a DTD. Otherwise it's invalid XML. If you don't have a DTD, you should decode the entity prior to loading the XML with DOM:

$dom->load(
    html_entity_decode(
        file_get_contents($_FILES["file"]["tmp_name"]), 
        ENT_COMPAT, 'UTF-8'));
Gordon
A: 

Your XML parser is not lying. That's an invalid (not even well-formed) document that you won't be able to load with anything.

rsquo is a predefined entity in HTML, but not in XML. In an XML document if you want to use anything but the most basic built-in entities (amp, lt, gt, quot and apos) you must define them in a DTD referenced by a <!DOCTYPE> declaration. (This is how XHTML does it.)

You need to find out where the input came from and fix whatever was responsible for creating it, because at the moment it's simply not XML. Use a character reference (&#8217;) or just the plain literal character in UTF-8 encoding.

As a last resort if you really have to accept this malformed nonsense for input you could do nasty string replacements over the file:

$xml= file_get_contents($_FILES['file']['tmp_name']);
$xml= str_replace('&rsquo;', '&#8217;', $xml);
$dom->loadXML(xml);

If you need to do this for all the non-XML HTML entities and not just rsquo that's a bit more tricky. You could do a regex replacement:

function only_html_entity_decode($match) {
    if (in_array($match[1], array('amp', 'lt', 'gt', 'quot', 'apos')))
        return $match[0];
    else
        return html_entity_decode($match[0], ENT_COMPAT, 'UTF-8');
}
$xml= preg_replace_callback('/&(\w+);/', 'only_html_entity_decode', $xml);

This still isn't great as it's going to be mauling any sequences of &\w+; characters inside places like comments, CDATA sections and PIs, where this doesn't actually mean an entity reference. But it's probably about the best you can do given this broken input.

It's certainly better than calling html_entity_decode over the whole document, which will also mess up any XML entity references, causing the document to break whenever there's an existing &amp; or &lt;.

Another hack, questionable in different ways, would be to load the document using loadHTML().

bobince
Thanks for your help.
Bendim
A: 

My solution with help from bobince is:

    $xml= file_get_contents($_FILES['file']['tmp_name']);
    $xml= preg_replace('/&(\w+);/', '', $xml);
    $dom = new DomDocument();
    $dom->loadXML($xml);
Bendim
bobince
Bendim