views:

2409

answers:

2

I'm trying to parse some HTML with XPath. Following the simplified XML example below, I want to match the string 'Text 1', then grab the contents of the relevant content node.

<doc>
    <block>
        <title>Text 1</title>
        <content>Stuff I want</content>
    </block>

    <block>
        <title>Text 2</title>
        <content>Stuff I don't want</content>
    </block>
</doc>

My Python code throws a wobbly:

>>> from lxml import etree
>>>
>>> tree = etree.XML("<doc><block><title>Text 1</title><content>Stuff 
I want</content></block><block><title>Text 2</title><content>Stuff I d
on't want</content></block></doc>")
>>>
>>> # get all titles
... tree.xpath('//title/text()')
['Text 1', 'Text 2']
>>>
>>> # match 'Text 1'
... tree.xpath('//title/text()="Text 1"')
True
>>>
>>> # Follow parent from selected nodes
... tree.xpath('//title/text()/../..//text()')
['Text 1', 'Stuff I want', 'Text 2', "Stuff I don't want"]
>>>
>>> # Follow parent from selected node
... tree.xpath('//title/text()="Text 1"/../..//text()')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "lxml.etree.pyx", line 1330, in lxml.etree._Element.xpath (src/
lxml/lxml.etree.c:14542)
  File "xpath.pxi", line 287, in lxml.etree.XPathElementEvaluator.__ca
ll__ (src/lxml/lxml.etree.c:90093)
  File "xpath.pxi", line 209, in lxml.etree._XPathEvaluatorBase._handl
e_result (src/lxml/lxml.etree.c:89446)
  File "xpath.pxi", line 194, in lxml.etree._XPathEvaluatorBase._raise
_eval_error (src/lxml/lxml.etree.c:89281)
lxml.etree.XPathEvalError: Invalid type

Is this possible in XPath? Do I need to express what I want to do in a different way?

+3  A: 

Do you want that?

//title[text()='Text 1']/../content/text()
Johannes Weiß
Duh, simple really! Kinda makes sense that I'm selecting the text() attribute now.
Mat
you can also use //block[title='Text 1']/content to get the relevant content node
Dror
@Dror: Now that's useful to know.
Mat
+4  A: 

Use:

string(/*/*/title[. = 'Text 1']/following-sibling::content)

This represents at least two improvements as compared to to the currently accepted soulution of Johannes Weiß:

  1. The very expensive abbreviation "//" (usually causing the whole XML document to be scanned) is avoided as it should be whenever the structure of the XML document is known in advance.

  2. There is no return back to the parent (the location step "/.." is avoided)

Dimitre Novatchev
Fair improvement, my actual document is HTML and the 'title' part is nested about five levels deep so I have to go back about five parents to get to the 'content' area. I'll bear the first point in mind, though it'll make little difference for a dirty hack.
Mat