ansaurus

Question

load DOMDocument with HTML Special Characters (php)

Answer 1

+1 A:

In order to use that Entity, it must be defined in a DTD. Otherwise it's invalid XML. If you don't have a DTD, you should decode the entity prior to loading the XML with DOM:

$dom->load(
    html_entity_decode(
        file_get_contents($_FILES["file"]["tmp_name"]), 
        ENT_COMPAT, 'UTF-8'));

Gordon 2010-10-08 22:43:46

Answer 2

A:

Your XML parser is not lying. That's an invalid (not even well-formed) document that you won't be able to load with anything.

rsquo is a predefined entity in HTML, but not in XML. In an XML document if you want to use anything but the most basic built-in entities (amp, lt, gt, quot and apos) you must define them in a DTD referenced by a <!DOCTYPE> declaration. (This is how XHTML does it.)

You need to find out where the input came from and fix whatever was responsible for creating it, because at the moment it's simply not XML. Use a character reference (’) or just the plain literal character ’ in UTF-8 encoding.

As a last resort if you really have to accept this malformed nonsense for input you could do nasty string replacements over the file:

$xml= file_get_contents($_FILES['file']['tmp_name']);
$xml= str_replace('&rsquo;', '&#8217;', $xml);
$dom->loadXML(xml);

If you need to do this for all the non-XML HTML entities and not just rsquo that's a bit more tricky. You could do a regex replacement:

function only_html_entity_decode($match) {
    if (in_array($match[1], array('amp', 'lt', 'gt', 'quot', 'apos')))
        return $match[0];
    else
        return html_entity_decode($match[0], ENT_COMPAT, 'UTF-8');
}
$xml= preg_replace_callback('/&(\w+);/', 'only_html_entity_decode', $xml);

This still isn't great as it's going to be mauling any sequences of &\w+; characters inside places like comments, CDATA sections and PIs, where this doesn't actually mean an entity reference. But it's probably about the best you can do given this broken input.

It's certainly better than calling html_entity_decode over the whole document, which will also mess up any XML entity references, causing the document to break whenever there's an existing & or <.

Another hack, questionable in different ways, would be to load the document using loadHTML().

bobince 2010-10-08 23:01:56

Thanks for your help.

Bendim 2010-10-09 01:19:28

Answer 3

A:

My solution with help from bobince is:

    $xml= file_get_contents($_FILES['file']['tmp_name']);
    $xml= preg_replace('/&(\w+);/', '', $xml);
    $dom = new DomDocument();
    $dom->loadXML($xml);

Bendim 2010-10-09 01:23:43

bobince 2010-10-09 01:46:45

Bendim 2010-10-09 01:58:13

ansaurus

tags:

views:

answers:

load DOMDocument with HTML Special Characters (php)

related questions