views:

1241

answers:

2

I am testing against the following test document:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" 
                      "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"&gt;
<html xmlns="http://www.w3.org/1999/xhtml"&gt;
   <head>
        <title>hi there</title>
    </head>
    <body>
        <img class="foo" src="bar.png"/>
    </body>
</html>

If I parse the document using lxml.html, I can get the IMG with an xpath just fine:

>>> root = lxml.html.fromstring(doc)
>>> root.xpath("//img")
[<Element img at 1879e30>]

However, if I parse the document as XML and try to get the IMG tag, I get an empty result:

>>> tree = etree.parse(StringIO(doc))
>>> tree.getroot().xpath("//img")
[]

I can navigate to the element directly:

>>> tree.getroot().getchildren()[1].getchildren()[0]
<Element {http://www.w3.org/1999/xhtml}img at f56810>

But of course that doesn't help me process arbitrary documents. I would also expect to be able to query etree to get an xpath expression that will directly identify this element, which, technically I can do:

>>> tree.getpath(tree.getroot().getchildren()[1].getchildren()[0])
'/*/*[2]/*'
>>> tree.getroot().xpath('/*/*[2]/*')
[<Element {http://www.w3.org/1999/xhtml}img at fa1750>]

But that xpath is, again, obviously not useful for parsing arbitrary documents.

Obviously I am missing some key issue here, but I don't know what it is. My best guess is that it has something to do with namespaces but the only namespace defined is the default and I don't know what else I might need to consider in regards to namespaces.

So, what am I missing?

+7  A: 

The problem is the namespaces. When parsed as XML, the img tag is in the http://www.w3.org/1999/xhtml namespace since that is the default namespace for the element. You are asking for the img tag in no namespace.

Try this:

>>> tree.getroot().xpath(
...     "//xhtml:img", 
...     namespaces={'xhtml':'http://www.w3.org/1999/xhtml'}
...     )
[<Element {http://www.w3.org/1999/xhtml}img at 11a29e0>]
Ned Batchelder
Quoting from http://codespeak.net/lxml/xpathxslt.html<<Optionally, you can provide a namespaces keyword argument, which should be a dictionary mapping the namespace prefixes used in the XPath expression to namespace URIs>>
Cristian Ciupitu
+4  A: 

XPath considers all unprefixed names to be in "no namespace".

In particular the spec says:

"A QName in the node test is expanded into an expanded-name using the namespace declarations from the expression context. This is the same way expansion is done for element type names in start and end-tags except that the default namespace declared with xmlns is not used: if the QName does not have a prefix, then the namespace URI is null (this is the same way attribute names are expanded). "

See those two detailed explanations of the problem and its solution: here and here. The solution is to associate a prefix (with the API that's being used) and to use it to prefix any unprefixed name in the XPath expression.

Hope this helped.

Cheers,

Dimitre Novatchev

Dimitre Novatchev