views:

1079

answers:

5

I have been trying to strip out some data from HTML files. I have the logic coded to get the right cells. Now I am struggling to get the actual contents of the 'cell':

here is my htm snip

headerRows[0][10].contents

  [<font size="+0"><font face="serif" size="1"><b>Apples Produced</b><font size="3">       
  </font></font></font>]

Note that this is a list item from Python [].

I need the value Apples Produced but can't get to it.

Any suggestions would be appreciated

Suggestions on a good book that explains this would earn my eternal gratitude

+5  A: 

The BeautifulSoup documentation should cover everything you need - in this case it looks like you want to use findNext:

headerRows[0][10].findNext('b').string

A more generic solution which doesn't rely on the <b> tag would be to use the text argument to findAll, which allows you to search only for NavigableString objects:

>>> s = BeautifulSoup(u'<p>Test 1 <span>More</span> Test 2</p>')
>>> u''.join([s.string for s in s.findAll(text=True)])
u'Test 1 More Test 2'
insin
+1  A: 

Thanks for that answer. However-isn't there a more general answer. What happens if my cell doesn't have a bold attribute

say it is:

 [<font size="+0"><font face="serif" size="1"><I>Apples Produced</I><font size="3">       
  </font></font></font>]

Apples Produced

I am trying to learn to read/understand the documentation and your response will help

PyNEwbie
A: 

I have a base class that I extend all Beautiful Soup classes with a bunch of methods that help me get at text within a group of elements that I don't necessarily want to rely on the structure of. One of those methods is the following:

  def clean(self, val):
    if type(val) is not StringType: val = str(val)
    val = re.sub(r'<.*?>', '', s) #remove tags
    val = re.sub("\s+" , " ", val) #collapse internal whitespace
    return val.strip() #remove leading & trailing whitespace
ThePants
A: 

I really appreciate this help. The best thing about these answers is that it is a lot easier to generalize from them then I have been able to do so from the BeautifulSoup documentation. I learned to program in the Fortran era and I while I am enjoying learning python and am amzed at its power-BeautifulSoup is an example. making a cohernet whole of the documentation is tough for me.

Cheers

PyNEwbie
+3  A: 
headerRows[0][10].contents[0].find('b').string