views:

602

answers:

1

I'm trying to take a string of text, and "extract" the rest of the text in the paragraph/document from the html.

My current is approach is trying to find the "parent tag" of the string in the html that has been parsed with lxml. (if you know of a better way to tackle this problem, I'm all ears!)

For example, search the tree for "TEXT STRING HERE" and return the "p" tag. (note that I won't know the exact layout of the html beforehand)

<html>
<head>
...
</head>
<body>
.... 
<div>
...
<p>TEXT STRING HERE ......</p>
...
</html>

Thanks for your help!

+1  A: 

This is a simple way to do it with ElementTree. It does require that your HTML input is valid XML (so I have added the appropriate end tags to your HTML):

import elementtree.ElementTree as ET

html = """<html>
<head>
</head>
<body>
<div>
<p>TEXT STRING HERE ......</p> 
</div>
</body>
</html>"""

for e in ET.fromstring(html).getiterator():
    if e.text.find('TEXT STRING HERE') != -1:
        print "Found string %r, element = %r" % (e.text, e)
mhawke