ansaurus

Question

extract specific element from nested elements using lxml html

Answer 1

A:

Perhaps this would work for you:

tree.xpath("//table[not(descendant::table)]/*[contains(., 'Header1')]")

The not(descendant::table) bit ensures that you're getting the innermost table.

Michał Marczyk 2010-04-14 05:48:14

Answer 2

A:

table, = tree.xpath('//*[.="Header1"]/ancestor::table[1]')

//*[text()="Header1"] selects an element anywhere in a document with text Header1.
ancestor::table[1] selects the first ancestor of the element that is table.

Complete example

#!/usr/bin/env python
from lxml import html

page = """
<table>
    <tr>
    <td>
        <table>
            <tr><td></td></tr>
            <tr><td>
                <table>
                    <tr><td><u><b>Header1</b></u></td></tr> 
                    <tr><td>Data</td></tr>
                </table>
            </td></tr>
        </table>
     </td></tr>
</table>
"""

tree = html.fromstring(page)
table, = tree.xpath('//*[.="Header1"]/ancestor::table[1]')
print html.tostring(table)

J.F. Sebastian 2010-04-14 06:05:19

While this is correct for the example, I think it is too generic to use `//*[.="Header1"]`. There could be a `see <i>Header1</i>` somewhere in the input and your expression would match the `<i>`.

Tomalak 2010-04-14 08:58:45

@Tomalak: It always matches *`<table>`* element. It doesn't matter what element contains `"Header1"` as long as it is somewhere inside the `<table>` element.

J.F. Sebastian 2010-04-14 10:04:52

Right, no argument there. Still, my point is that you might not be matching the *table header* as such, but anything generic that by chance contains the text `'Header1'`. So chances are you match the wrong table.

Tomalak 2010-04-14 10:53:03

Answer 3

+1 A:

Find the header you are interested in and then pull out its table.

//u[b = 'Header1']/ancestor::table[1]

or

//td[not(.//table) and .//b = 'Header1']/ancestor::table[1]

Note that // always starts at the document root (!). You can't do:

//table[//*[contains(text(), "Header1")]]

and expect the inner predicate (//*…) to magically start at the right context. Use .// to start at the context node. Even then, this:

//table[.//*[contains(text(), "Header1")]]

won't work since even the outermost table contains the text 'Header1' somewhere deep down, so the predicate evaluates to true for every table in your example. Use not() like I did to make sure no other tables are nested.

Also, don't test the condition on every node .//*, since it can't be true for every node to begin with. It's more efficient to be specific.

Tomalak 2010-04-14 08:47:30

Answer 4

+1 A:

Use:

//td[text() = 'Header1']/ancestor::table[1]

Dimitre Novatchev 2010-04-14 13:04:24

Accepting this. But would you happen to know why under certain configurations (like my production environment) this returns an array with 3 tables, all the same table? It's giving me the table 3 times for some reason. Same python version, same version of lxml, same script, same test data...

Dan.StackOverflow 2010-04-18 05:36:23

@Dan.StackOverflow: This means that in this tablethere are three `td` elements with a text node with value "Header1"

Dimitre Novatchev 2010-04-18 14:12:02

ansaurus

tags:

views:

answers:

extract specific element from nested elements using lxml html

Complete example

related questions