ansaurus

Question

Once I have identified the beginning and end parts of a section of an html document using lxml, how do I get everything between them

Answer 1

+1 A:

I'd suggest using SAX for this task.

Basic docs are available at http://codespeak.net/lxml/sax.html#producing-sax-events-from-an-elementtree-or-element

Your handler should consume events w/out any action till it receives needed bolded item, and then it writes events into new buffer/tree/whatever till it receives terminating bolded item.

Daniel Kluev 2010-08-17 04:40:55

Thanks for your trouble, I did look at SAX and decided that I didn't want to climb that hill yet though it looks very useful. I am a beginner.

PyNEwbie 2010-08-18 21:18:43

Answer 2

A:

In the spirit of SO I have figured out what I think is the best answer and am going to post it myself.

import lxml
from lxml import html
testFile=open(r'c:\temp\testlxml.htm').read()
aTree=html.fromstring(testFile)
bolds=aTree.cssselect('b')
theTitles=[item.text for item in bolds if item.text if 'KEY' in item.text]
theBoldKeys=[item for item in bolds if item.text if 'KEY' in item.text]
theFullList=[]
for e in aTree.iter():
    theFullList.append(e)

for numb,item in enumerate(theFullList):
    if item==theBoldItems[0]:
        first=numb
    if item==theBoldItems[1]:
        second=numb
theText=[]
for item in theFullList[first:second]:
    if item.text:
        theText.append(item.text)
    if item.tail:
       theText.append(item.tail)

aString=' '.join(theText)

A little bit of explanation.

My goal is to apply some logic to the bolded parts of the documents as those bolded sections that have the word KEY in them define different sections of the document. TheTitles is a list of the bolded elements that have the word 'KEY' included. Based on my particular needs I might want all of the text between any two items from theTitles, I can create tests and the necessary logic to select items from theTitles.

theBoldItems is a list of the actual elements, for any i theTitles[i]==theBoldItems[i].text

next I get theFullList which is all of the htm elements in the tree. Because LXML builds the tree in order I know that I want to capture all of the elements theBoldItems[i] and theBoldItems[i+1]. And the nice thing is that the way Python is built the test is that easy.

I can now get the text for all of those things and while I still need to clean it up some I have successfully ripped out all of the text between any two items I might want.

PyNEwbie 2010-08-18 21:13:37

ansaurus

tags:

views:

answers:

Once I have identified the beginning and end parts of a section of an html document using lxml, how do I get everything between them

related questions