views:

231

answers:

2

Okay, this is starting to drive me a little bit nuts. I've tried several xml/xpath libraries for Python, and can't figure out a simple way to get a stinkin' "title" element.

The latest attempt looks like this (using Amara):

def view(req, url):
    req.content_type = 'text/plain'
    doc = amara.parse(urlopen(url))
    for node in doc.xml_xpath('//title'):
 req.write(str(node)+'\n')

But that prints out nothing. My XML looks like this: http://programanddesign.com/feed/atom/

If I try //* instead of //title it returns everything as expected. I know that the XML has titles in there, so what's the problem? Is it the namespace or something? If so, how can I fix it?


Can't seem to get it working with no prefix, but this does work:

def view(req, url):
    req.content_type = 'text/plain'
    doc = amara.parse(url, prefixes={'atom': 'http://www.w3.org/2005/Atom'})
    req.write(str(doc.xml_xpath('//atom:title')))
+1  A: 

You probably just have to take into account the namespace of the document which you're dealing with.

I'd suggest looking up how to deal with namespaces in Amara:

http://www.xml3k.org/Amara/Manual#namespaces

Edit: Using your code snippet I made some edits. I don't know what version of Amara you're using but based on the docs I tried to accommodate it as much as possible:

def view(req, url):
    req.content_type = 'text/plain'
    ns = {u'f' : u'http://www.w3.org/2005/Atom',
        u't' : u'http://purl.org/syndication/thread/1.0'}
    doc = amara.parse(urlopen(url), prefixes=ns)
    req.write(str(doc.xml_xpath(u'f:title')))
meder
That doesn't really help me. What if I don't know the namespaces beforehand? What if I don't really care what the namespace is?
Mark
You said your xml document is similar to the one you linked to. The one you linked to contains a namespace. There's a *reason* namespaces are used - of course you can get rid of your namespace from your xml document, then you don't have to worry about it. Otherwise you must account for it.
meder
@"care what the namespace is" - you could probably parse the xmlns attribute and just register that value.
meder
Okay... supposing I *did* know what the namespaces were....setting them to "none" still doesn't work.
Mark
I have no idea. I just snagged a fresh copy off the ubuntu repositories, so it should be pretty recent/up to date. Got it working now... I guess I can't have no prefix, which I still think is dumb, but I guess I can work with it.
Mark
Ah. Was your final solution similar to my latest edit?
meder
Similar yes. I edited the question again, it's more or less the same.
Mark
"setting them to "none" still doesn't work." No, you need to filter them out of the XML source if you want to get rid of them.
Lennart Regebro
Right, you can't avoid dealing with namespaces unless you alter the original xml source. The whole point of namespaces I believe is to avoid possible conflicts with duplicate node names/attributes. It's in the XPath and XML standards, which I would obey if I were using those technologies.
meder
When writing XML you can define a namespace for the whole document at the top. So why not when querying XML can't you specify a default namespace to use? I'm 98% sure other libraries let you do this...
Mark
@Mark: In XPath, there is no such thing as a default namespace. I guess it would be possible to fake it, though, and maybe some libraries does that.
Lennart Regebro
In .NET's XML library, you can create an `XmlNamespaceManager` that maps the empty string to one namespace and a prefix to the empty namespace. I don't know why `lxml` doesn't support this.
Robert Rossney
+1  A: 

It is indeed the namespaces. It was a bit tricky to find in the lxml docs, but here's how you do it:

from lxml import etree
doc = etree.parse(open('index.html'))
doc.xpath('//default:title', namespaces={'default':'http://www.w3.org/2005/Atom'})

You can also do this:

title_finder = etree.ETXPath('//{http://www.w3.org/2005/Atom}title')
title_finder(doc)

And you'll get the titles back in both cases.

Lennart Regebro
What if I don't know the namespaces beforehand? I just want to get rid of em. They might even be defined half-way through the document (on a div or something).
Mark
why can't you just parse the xmlns attribute?
meder
XML is a fully generic data exchange protocol. If you don't know the format you typically can't do very much useful things with the data, as you don't know what the data means. Also if you don't know the structure beforehand, then you MUST take care of and parse namespaces wherever they appear. That however is a generalized XML parsing problem, and highly unlikely to be the case. So I think you do know quite a bit of the structure, including either what the namespaces are, or where they are likely to be defined. So: Not a problem.
Lennart Regebro
Knowing the structure may include knowledge that the namespace is always the same and can always be safely ignored. In that case, you can filter it out from the document first. But then again, in that case you know there is a namespace there, and you do care, since you filter it out. And then you might as well just include it in the xpath methods.
Lennart Regebro