ansaurus

Question

Answer 1

A:

I'd prefer regex to get that section, you dont need to parse the html as long as you have a static pattern. You should search for the text between

<strong>Additional  Info</strong></td><td valign="top">&nbsp;</td><td valign="top" align="left">

and

</td>

These links will be helpful.

http://docs.python.org/library/re.html

http://diveintopython.org/regular_expressions/index.html

huseyinalb 2010-08-25 18:49:04

Regex for HTML isn't a good idea. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454

Beau Martínez 2010-08-25 18:52:29

Agreed, I would prefer to stick with lxml or beautifulsoup if I could get it to work.

bababa 2010-08-25 18:58:12

html parsing is a too slow way to get an information from an html document.The link you gave has a general html tag that should be found by an html parser. There are lots of bad ways people use that tags. they use <a> < a> <a > <a / > or something else. But this is a spesific page that has a static pattern. I wouldnt use a sword to prepare a salad.

huseyinalb 2010-08-25 20:01:41

`re` over hundreds of files is also slow, and less likely to consistently work.

Tim McNamara 2010-09-15 20:34:14

Answer 2

+2 A:

import lxml.html as lh
import urllib2

def text_tail(node):
    yield node.text
    yield node.tail

url='http://bit.ly/bf1T12'
doc=lh.parse(urllib2.urlopen(url))
for elt in doc.iter('td'):
    text=elt.text_content()
    if text.startswith('Additional  Info'):
        blurb=[text for node in elt.itersiblings('td')
               for subnode in node.iter()
               for text in text_tail(subnode) if text and text!=u'\xa0']
        break
print('\n'.join(blurb))

yields

For over 65 years, Carl Stirn's Marine has been setting new standards of excellence and service for boating enjoyment. Because we offer quality merchandise, caring, conscientious, sales and service, we have been able to make our customers our good friends.

Our 26,000 sq. ft. facility includes a complete parts and accessories department, full service department (Merc. Premier dealer with 2 full time Mercruiser Master Tech's), and new, used, and brokerage sales.

Edit: Here is an alternate solution based on Steven D. Majewski's xpath which addresses the OP's comment that the number of tags separating 'Additional Info' from the blurb can be unknown:

import lxml.html as lh
import urllib2

url='http://bit.ly/bf1T12'
doc=lh.parse(urllib2.urlopen(url))

blurb=doc.xpath('//td[child::*[text()="Additional  Info"]]/following-sibling::td/text()')

blurb=[text for text in blurb if text != u'\xa0']
print('\n'.join(blurb))

unutbu 2010-08-25 19:33:56

You could also use:doc.xpath( '(//td[descendant-or-self::*[text()="Additional Info"]])[last()]/following-sibling::td[2]/text()' )instead of iterating. ( That xpath may be more complicated than necessary -- I'm trying to avoid depending on the <strong> or other tags, the same as your iterating code. )

Steven D. Majewski 2010-08-25 20:57:28

The only problem with this I could see is the getnext().getnext(), because the html might not always be exactly the same on each page. One page might only have one td tag in between the text, the next page might have two td tags. Is there anyway to iter over the tr tags until you find the one with the additional info text, then strip out the html and you would be left with just the text ... only a thougt.

bababa 2010-08-25 21:16:49

@Steven: Thank you for the XPath. I'm still trying to learn XPath, and your example is very instructive for me.

unutbu 2010-08-25 21:29:28

Wow, thanks for that. xpath is pretty interesting stuff. Appropriate the help from both of you.

bababa 2010-08-25 21:35:08

ansaurus

tags:

views:

answers:

Parsing HTML with Lxml

related questions