views:

45

answers:

3

There must be an easier way to do this. I need some text from a large number of html documents. In my tests the most reliable way to find it is to look for specific word in the text_content of the div elements. If I want to inspect a specific element above the one that has my text I have been enumerating my list of div elements and using the index of the one that has my text to then specify a previous one by acting on the index. But I am sure there must be a better way. I can't seem to figure that out.

If not clear

for pair in enumerate(list_of_elements):
    if 'the string' in pair[1].text_content():
        thelocation=pair[0]

the_other_text=list_of_elements[thelocation-9].text_content()     

or

theitem.getprevious().getprevious().getprevious().getprevious().getprevious().getprevious().getprevious().getprevious().getprevious().text_content()
A: 

Use something like simplehtmldom, and then provide an index?

Amber
+1  A: 

Does this do the trick?

from itertools import islice
ancestor = islice(theitem.iterancestors(), 4) # To get the fourth ancestor

EDIT I'm an idiot, that doesn't do the trick. You'll need to wrap it up in a helper function like so:

def nthparent(element, n):
    parent = islice(element.iterancestors(), n, n+1)
    return parent[0] if parent else None

ancestor = nthparent(theitem, 4) # to get the 4th parent
Will McCutchen
I am playing with ancestor right now trying to figure out how to manipulate the objects in it. I see that I get four ancestors. Thanks
PyNEwbie
@PyNEwebie see my edited answer. The code I gave you initially didn't do what you needed it to do.
Will McCutchen
Thanks I understand more and this is helpful.
PyNEwbie
`islice` returns an iterator therefore you should write `next(isclice(..), None)` instead of `parent[0] ..`
J.F. Sebastian
+3  A: 

lxml supports XPath:

from lxml import etree
root = etree.fromstring("...your xml...")

el, = root.xpath("//div[text() = 'the string']/preceding-sibling::*[9]")
J.F. Sebastian
But I am a beginner how does this do me any better - and I am using html. I started with mytree=fromstring(thedocument) and then list_of_elements=mytree.cssselect('div')
PyNEwbie
@PyNEwbie: The above xpath expression is just an example, it should be something like `elements[-1].xpath("preceding-sibling::div[9]")` in your case.
J.F. Sebastian
I've added combined xpath expression
J.F. Sebastian