views:

31

answers:

2

Hi guys,

I'm having some difficult with PHP DOM class.

I am making a sitemap script, and I need the output of $doc->saveXML() to be like

<?xml version="1.0" encoding="UTF-8"?>
<root>
    <url>
        <loc>http://www.somesite.com/servi&amp;#xE7;os/redesign&lt;/loc&gt;
    </url>
</root>

or

<?xml version="1.0" encoding="UTF-8"?>
<root>
    <url>
        <loc>http://www.somesite.com/servi&amp;#231;os/redesign&lt;/loc&gt;
    </url>
</root>

but I am getting:

<?xml version="1.0" encoding="UTF-8"?>
<root>
    <url>
        <loc>http://www.somesite.com/servi&amp;amp;#xE7;os/redesign&lt;/loc&gt;
    </url>
</root>

This is the closet I could get, using a replace named to numbered entities function.

I was also able to reproduce

<?xml version="1.0" ?>
<root>
    <url>
        <loc>http://www.somesite.com/servi&amp;amp;#xE7;os/redesign&lt;/loc&gt;
    </url>
</root>

But without the encoding specified.

The best solution (the way I think the code should be written) would be:

<?php
$myArray = array();
// do some stuff to populate the with URL strings

$doc = new DOMDocument('1.0', 'UTF-8');

// here we modify some property. Maybe is the answer I am looking for...

$urlset = doc->createElement("urlset");
$urlset = $doc->appendChild($urlset);

foreach($myArray as $address) {
    $url = $doc->createElement("url");
    $url = $urlset->appendChild($url);

    $loc = $doc->createElement("loc");
    $loc = $url->appendChild($loc);

    $valueContent = $doc->createTextNode($value);
    $valueContent = $loc->appendChild($address);
}

echo $doc->saveXML();
?>

Notes:

  • Server response header contains charset as UTF-8;
  • PHP script is saved in UTF-8;
  • URLs read are UTF-8 strings;
  • Above script contains encoding declaration on DOMDocument constructor, and does not use any convert functions, like htmlentities, urlencode, utf8_encode...

I've tried changing the DOMDocument properties DOMDocument::$resolveExternals and DOMDocument::$substituteEntities values. None combinations worked.

And yes, I know I can made all process without specifying the character set on DOMDocument constructor, dump string content into a variable and make a very simple string substitution with string replace functions. This works. But I would like to know where I am slipping, how can this be made using native API's and settings, or even if this is possible.

Thanks in advance.

A: 

Decode your entities before passing it to createTextNode

$valueContent = $doc->createTextNode(html_entity_decode($value, ENT_QUOTES, 'UTF-8'));

That's because &#231; is not a valid entity in a UTF-8 document. So DomDocument sees the & and encodes it as &amp;

ircmaxell
A: 

resolveExternals and substituteEntities are parser features. They don't have an effect on serialisation.

The XML infoset doesn't make any distinction whatsoever between:

<loc>http://www.somesite.com/serviços/redesign&lt;/loc&gt;
<loc>http://www.somesite.com/servi&amp;#xE7;os/redesign&lt;/loc&gt;
<loc>http://www.somesite.com/servi&amp;#231;os/redesign&lt;/loc&gt;

they all represent exactly the same information, any XML parser must treat them as identical, and XML serializers don't generally let you choose which to output. Normally you should just set the text node's value to include ç and let the serialiser write it to ç, as a raw UTF-8 byte string in the output.

If you really must generate an XML file that contains only ASCII, so you can't use characters like ç directly, then tell PHP to use ASCII as the document encoding:

$s= "serviços"; // or "\xC3\xA7" if you can't input UTF-8 strings directly

$doc = new DOMDocument('1.0', 'US-ASCII');
$doc->appendChild($loc= $doc->createElement('loc'));
$loc->appendChild($doc->createTextNode($s));
echo $doc->saveXML();

result:

<?xml version="1.0" encoding="US-ASCII"?>
<loc>servi&#231;os</loc>

However... having said all that, I still don't think this is right. Your value seems to be a URL, and non-ASCII characters aren't valid in URLs regardless of how they're encoded in the containing XML. It should be:

http://www.somesite.com/servi%C3%A7os/redesign

ie. rawurlencode('serviços').

bobince
Thanks for your inputs and clarifications.It wasn't exactly the solution I was looking for, but helped me a lot leading to the right path.
Dave