views:

240

answers:

4

Hi all I am having some problems that I think can be attributed to xpath problems. I am using the html module from the lxml package to try and get at some data. I am providing the most simplified situation below, but keep in mind the html I am working with is much uglier.

<table>
    <tr>
    <td>
        <table>
            <tr><td></td></tr>
            <tr><td>
                <table>
                    <tr><td><u><b>Header1</b></u></td></tr> 
                    <tr><td>Data</td></tr>
                </table>
            </td></tr>
        </table>
     </td></tr>
</table>

What I really want is the deeply nested table, because it has the header text "Header1". I am trying like so:

from lxml import html
page = '...'
tree = html.fromstring(page)
print tree.xpath('//table[//*[contains(text(), "Header1")]]')

but that gives me all of the table elements. I just want the one table that contains this text. I understand what is going on but am having a hard time figuring out how to do this besides breaking out some nasty regex. Any thoughts?

A: 

Perhaps this would work for you:

tree.xpath("//table[not(descendant::table)]/*[contains(., 'Header1')]")

The not(descendant::table) bit ensures that you're getting the innermost table.

Michał Marczyk
A: 
table, = tree.xpath('//*[.="Header1"]/ancestor::table[1]')
  • //*[text()="Header1"] selects an element anywhere in a document with text Header1.
  • ancestor::table[1] selects the first ancestor of the element that is table.

Complete example

#!/usr/bin/env python
from lxml import html

page = """
<table>
    <tr>
    <td>
        <table>
            <tr><td></td></tr>
            <tr><td>
                <table>
                    <tr><td><u><b>Header1</b></u></td></tr> 
                    <tr><td>Data</td></tr>
                </table>
            </td></tr>
        </table>
     </td></tr>
</table>
"""

tree = html.fromstring(page)
table, = tree.xpath('//*[.="Header1"]/ancestor::table[1]')
print html.tostring(table)
J.F. Sebastian
While this is correct for the example, I think it is too generic to use `//*[.="Header1"]`. There could be a `see <i>Header1</i>` somewhere in the input and your expression would match the `<i>`.
Tomalak
@Tomalak: It always matches *`<table>`* element. It doesn't matter what element contains `"Header1"` as long as it is somewhere inside the `<table>` element.
J.F. Sebastian
Right, no argument there. Still, my point is that you might not be matching the *table header* as such, but anything generic that by chance contains the text `'Header1'`. So chances are you match the wrong table.
Tomalak
+1  A: 

Find the header you are interested in and then pull out its table.

//u[b = 'Header1']/ancestor::table[1]

or

//td[not(.//table) and .//b = 'Header1']/ancestor::table[1]

Note that // always starts at the document root (!). You can't do:

//table[//*[contains(text(), "Header1")]]

and expect the inner predicate (//*…) to magically start at the right context. Use .// to start at the context node. Even then, this:

//table[.//*[contains(text(), "Header1")]]

won't work since even the outermost table contains the text 'Header1' somewhere deep down, so the predicate evaluates to true for every table in your example. Use not() like I did to make sure no other tables are nested.

Also, don't test the condition on every node .//*, since it can't be true for every node to begin with. It's more efficient to be specific.

Tomalak
+1  A: 

Use:

//td[text() = 'Header1']/ancestor::table[1]
Dimitre Novatchev
Accepting this. But would you happen to know why under certain configurations (like my production environment) this returns an array with 3 tables, all the same table? It's giving me the table 3 times for some reason. Same python version, same version of lxml, same script, same test data...
Dan.StackOverflow
@Dan.StackOverflow: This means that in this tablethere are three `td` elements with a text node with value "Header1"
Dimitre Novatchev