views:

45

answers:

2

I am working with some html files. I am trying to figure out a way to consistently get to some text that exists in the documents. I know that the section I want begins with some bolded words and I know that the section ends with other bolded words.

bolded_item=atree.cssselect('b')

myKeys=[item for item in bolded_items if item.text if 'KEY' in item.text]

so myKeys is a list whose members are elements from atree, specifically elements that have bolded text and have the word 'KEY' in the text.

I want now to identify all of the parts of the tree between any 2 elements in myKeys I want to be able to manipulate them in various ways. I was playing around with getparent, getchildren getnext and all of the other methods that looked likely after running a dir(myKeys[0]) but I am not making progress.

Any suggestions would be appreciated

+1  A: 

I'd suggest using SAX for this task.

Basic docs are available at http://codespeak.net/lxml/sax.html#producing-sax-events-from-an-elementtree-or-element

Your handler should consume events w/out any action till it receives needed bolded item, and then it writes events into new buffer/tree/whatever till it receives terminating bolded item.

Daniel Kluev
Thanks for your trouble, I did look at SAX and decided that I didn't want to climb that hill yet though it looks very useful. I am a beginner.
PyNEwbie
A: 

In the spirit of SO I have figured out what I think is the best answer and am going to post it myself.

import lxml
from lxml import html
testFile=open(r'c:\temp\testlxml.htm').read()
aTree=html.fromstring(testFile)
bolds=aTree.cssselect('b')
theTitles=[item.text for item in bolds if item.text if 'KEY' in item.text]
theBoldKeys=[item for item in bolds if item.text if 'KEY' in item.text]
theFullList=[]
for e in aTree.iter():
    theFullList.append(e)

for numb,item in enumerate(theFullList):
    if item==theBoldItems[0]:
        first=numb
    if item==theBoldItems[1]:
        second=numb
theText=[]
for item in theFullList[first:second]:
    if item.text:
        theText.append(item.text)
    if item.tail:
       theText.append(item.tail)

aString=' '.join(theText)

A little bit of explanation.

My goal is to apply some logic to the bolded parts of the documents as those bolded sections that have the word KEY in them define different sections of the document. TheTitles is a list of the bolded elements that have the word 'KEY' included. Based on my particular needs I might want all of the text between any two items from theTitles, I can create tests and the necessary logic to select items from theTitles.

theBoldItems is a list of the actual elements, for any i theTitles[i]==theBoldItems[i].text

next I get theFullList which is all of the htm elements in the tree. Because LXML builds the tree in order I know that I want to capture all of the elements theBoldItems[i] and theBoldItems[i+1]. And the nice thing is that the way Python is built the test is that easy.

I can now get the text for all of those things and while I still need to clean it up some I have successfully ripped out all of the text between any two items I might want.

PyNEwbie