views:

31

answers:

3

This sounds like a pretty easy question to answer but I haven't been able to get it to work. I'm running PHP 5.2.6.

I have a DOM element (the root element) which, when I go to $element->saveXML(), it outputs an xmlns attribute:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<html xmlns="http://www.w3.org/1999/xhtml" lang="en">
...

However, I cannot find any way programmatically within PHP to see that namespace. I want to be able to check whether it exists and what it's set to.

Checking $document->documentElement->namespaceURI would be the obvious answer but that is empty (I've never actually been able to get that to be non-empty). What is generating that xmlns value in the output and how can I read it?

The only practical way I've been able to do this so far is a complete hack - by saving it as XML to a string using saveXML() then reading through that using regular expressions.

Edit:

This may be a peculiarity of loading XML in using loadHTML() rather than loadXML() and then printing it out using saveXML(). When you do that, it appears that for some reason saveXML adds an xmlns attribute even though there is no way to detect that this xmlns value is part of the document using DOM methods. Which I guess means that if I had a way of detecting whether the document passed in had been loaded in using loadHTML() then I could solve this a different way.

+2  A: 

With PHP 5.2.6 i've found 2 ways to this:

<?php
$xml = '<?xml version="1.0" encoding="UTF-8" standalone="yes"?'.
       '><html xmlns="http://www.w3.org/1999/xhtml" lang="en"></html>';
$x = DomDocument::loadXml($xml);
var_dump($x->documentElement->getAttribute("xmlns"));
var_dump($x->documentElement->lookupNamespaceURI(NULL));

prints

string(28) "http://www.w3.org/1999/xhtml"
string(28) "http://www.w3.org/1999/xhtml"

Hope thats what you asked for :)

edorian
Thanks for your answer - it doesn't solve my problem but tips me off that it seems to be something peculiar to documents loaded in from loadHTML() rather than loadXML() because indeed, your example works with loadXML(). Looks like loadHTML creates documents with an "invisible namespace" which can't be read using DOM methods but which appears when you saveXML().
thomasrutter
I'm not sure i can follow you 100% but loading something with loadHtml and resaving it via saveXml doesnt add a xmlns for me. It just adds / preserves a doctype from the html. Maybe if you can provide a little reproduce script alongside the output you want i can dig deeper
edorian
Interesting - it sometimes does and sometimes doesn't. If your input HTML document has an XHTML DOCTYPE, it does. It will do it for this input: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
thomasrutter
I have NO idea how you would detect that in a DOM.
thomasrutter
+1  A: 

Well, you can do so with a function like this:

function getNamespaces(DomNode $node, $recurse = false) {
    $namespaces = array();
    if ($node->namespaceURI) {
        $namespaces[] = $node->namespaceURI;
    }
    if ($node instanceof DomElement && $node->hasAttribute('xmlns')) {
        $namespaces[] = $xmlns = $node->getAttribute('xmlns');
        foreach ($node->attributes as $attr) {
            if ($attr->namespaceURI == $xmlns) {
                $namespaces[] = $attr->value;
                }
        }
    }
    if ($recurse && $node instanceof DomElement) {
        foreach ($node->childNodes as $child) {
            $namespaces = array_merge($namespaces, getNamespaces($child, vtrue));
        }
    }
    return array_unique($namespaces);
}

So, you feed it a DomEelement, and then it finds all related namespaces:

$xml = '<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
    <html xmlns="http://www.w3.org/1999/xhtml" 
         lang="en" 
         xmlns:foo="http://example.com/bar"&gt;
           <body>
                <h1>foo</h1>
                <foo:h2>bar</foo:h2>
           </body>
 </html>';
var_dump(getNamespaces($dom->documentElement, true));

Prints out:

array(2) {
  [0]=>
  string(28) "http://www.w3.org/1999/xhtml"
  [3]=>
  string(22) "http://example.com/bar"
}

Note that DomDocument will automatically strip out all unused namespaces...

As for why $dom->documentElement->namespaceURI is always null, it's because the document element doesn't have a namespace. The xmlns attribute provides a default namespace for the document, but it doesn't endow the html tag with a namespace (for purposes of DOM interaction). You can try doing a $dom->documentElement->removeAttribute('xmlns'), but I'm not 100% sure if it will work...

ircmaxell
+3  A: 

Like edorian already showed, getting the namespace works fine when the Markup is loaded with loadXML. But you are right that this wont work for Markup loaded with loadHTML:

$html = <<< XML
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:m="foo" lang="en">
    <body xmlns="foo">Bar</body>
</html>
XML;

$dom = new DOMDocument;
$dom->loadHTML($html);

var_dump($dom->documentElement->getAttribute("xmlns"));
var_dump($dom->documentElement->lookupNamespaceURI(NULL));
var_dump($dom->documentElement->namespaceURI);

will produce empty results. But you can use XPath

$xp = new DOMXPath($dom);
echo $xp->evaluate('string(@xmlns)');
// http://www.w3.org/1999/xhtml;

and for body

echo $xp->evaluate('string(body/@xmlns)'); // foo

or with context node

$body = $dom->documentElement->childNodes->item(0);
echo $xp->evaluate('string(@xmlns)', $body);
// foo

My uneducated assumption is that internally, a HTML Document is different from a real Document. Internally libxml uses a different module to parse HTML and the DOMDocument itself will be of a different nodeType, as you can simply verify by doing

var_dump($dom->nodeType); // 13 with loadHTML, 9 with loadXml

with 13 being a XML_HTML_DOCUMENT_NODE.

Gordon
very nice and detailed, didn't know about the nodeTypes depending on the parsing method but it makes sense
edorian
Thanks for the hint about nodetypes and the ability to use xpath - solves lots of my problems!
thomasrutter