ansaurus

Question

Answer 1

A:

The title is inside an <nyt_headline> tag, which is nested inside an <h1> tag and a <div> tag with id "article".

soup.findAll('nyt_headline', limit=1)

Should work.

The article body is inside an <nyt_text> tag, which is nested inside a <div> tag with id "articleBody". Inside the <nyt_text> element, the text itself is contained within <p> tags. Images are not within those <p> tags. It's difficult for me to experiment with the syntax, but I expect a working scrape to look something like this.

text = soup.findAll('nyt_text', limit=1)[0]
text.findAll('p')

Ewan Todd 2009-12-20 18:40:54

I'm sure this works for this test case however, looking for a more generic answer that may be applied to various other websites... So far, I've tried using regexps to find <script></script> tags and  comments and replace them with "" but that's even proving kinda difficult for sum reason..

2009-12-20 18:48:56

Answer 2

+7 A:

Try this:

html = urllib.urlopen('http://www.nytimes.com/2009/12/21/us/21storm.html').read()
soup = BeautifulSoup.BeautifulSoup(html)
texts = soup.findAll(text=True)

def visible(element):
    if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
        return False
    elif re.match('<!--.*-->', str(element)):
        return False
    return True

visible_texts = filter(visible, texts)

jbochi 2009-12-31 00:06:12

thanks jbochi, this is a great solution.

2010-01-05 09:41:54

@developerjay you're welcome :-)

jbochi 2010-01-05 22:00:10

ansaurus

tags:

views:

answers:

BeautifulSoup Grab Visible Webpage Text

related questions