views:

139

answers:

2

I need help parsing out some text from a page with lxml. I tried beautifulsoup and the html of the page I am parsing is so broken, it wouldn't work. So I have moved on to lxml, but the docs are a little confusing and I was hoping someone here could help me.

Here is the page I am trying to parse: http://bit.ly/bf1T12. I need to get the text under the "Additional Info" section. Note, that I have a lot of pages on this site like this to parse and each pages html is not always exactly the same (might contain some extra empty "td" tags). Any suggestions as to how to get at that text would be very much appreciated.

Thanks for the help.

A: 

I'd prefer regex to get that section, you dont need to parse the html as long as you have a static pattern. You should search for the text between

<strong>Additional  Info</strong></td><td valign="top">&nbsp;</td><td valign="top" align="left">

and

</td>

These links will be helpful.

http://docs.python.org/library/re.html

http://diveintopython.org/regular_expressions/index.html

huseyinalb
Regex for HTML isn't a good idea. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454
Beau Martínez
Agreed, I would prefer to stick with lxml or beautifulsoup if I could get it to work.
bababa
html parsing is a too slow way to get an information from an html document.The link you gave has a general html tag that should be found by an html parser. There are lots of bad ways people use that tags. they use <a> < a> <a > <a / > or something else. But this is a spesific page that has a static pattern. I wouldnt use a sword to prepare a salad.
huseyinalb
`re` over hundreds of files is also slow, and less likely to consistently work.
Tim McNamara
+2  A: 
import lxml.html as lh
import urllib2

def text_tail(node):
    yield node.text
    yield node.tail

url='http://bit.ly/bf1T12'
doc=lh.parse(urllib2.urlopen(url))
for elt in doc.iter('td'):
    text=elt.text_content()
    if text.startswith('Additional  Info'):
        blurb=[text for node in elt.itersiblings('td')
               for subnode in node.iter()
               for text in text_tail(subnode) if text and text!=u'\xa0']
        break
print('\n'.join(blurb))

yields

For over 65 years, Carl Stirn's Marine has been setting new standards of excellence and service for boating enjoyment. Because we offer quality merchandise, caring, conscientious, sales and service, we have been able to make our customers our good friends.

Our 26,000 sq. ft. facility includes a complete parts and accessories department, full service department (Merc. Premier dealer with 2 full time Mercruiser Master Tech's), and new, used, and brokerage sales.

Edit: Here is an alternate solution based on Steven D. Majewski's xpath which addresses the OP's comment that the number of tags separating 'Additional Info' from the blurb can be unknown:

import lxml.html as lh
import urllib2

url='http://bit.ly/bf1T12'
doc=lh.parse(urllib2.urlopen(url))

blurb=doc.xpath('//td[child::*[text()="Additional  Info"]]/following-sibling::td/text()')

blurb=[text for text in blurb if text != u'\xa0']
print('\n'.join(blurb))
unutbu
You could also use:doc.xpath( '(//td[descendant-or-self::*[text()="Additional Info"]])[last()]/following-sibling::td[2]/text()' )instead of iterating. ( That xpath may be more complicated than necessary -- I'm trying to avoid depending on the <strong> or other tags, the same as your iterating code. )
Steven D. Majewski
The only problem with this I could see is the getnext().getnext(), because the html might not always be exactly the same on each page. One page might only have one td tag in between the text, the next page might have two td tags. Is there anyway to iter over the tr tags until you find the one with the additional info text, then strip out the html and you would be left with just the text ... only a thougt.
bababa
@Steven: Thank you for the XPath. I'm still trying to learn XPath, and your example is very instructive for me.
unutbu
Wow, thanks for that. xpath is pretty interesting stuff. Appropriate the help from both of you.
bababa