views:

847

answers:

2

Basically, I want to use BeautifulSoup to grab strictly the visible text on a webpage... For instance, this webpage is my test case http://www.nytimes.com/2009/12/21/us/21storm.html .. And I mainly want to just get the body text (article) and and maybe even a few tab names here and there. However after trying this suggestion http://stackoverflow.com/questions/1752662/beautifulsoup-easy-way-to-to-obtain-html-free-contents > that returns lots of tags and html comments which aren't needed.. I can't figure out what are the right arguments to findAll (http://www.crummy.com/software/BeautifulSoup/documentation.html#arg-limit) that I need to do what I need...

So, how should I find all visible text excluding scripts/comments/css/junk...etc.. ??

A: 

The title is inside an <nyt_headline> tag, which is nested inside an <h1> tag and a <div> tag with id "article".

soup.findAll('nyt_headline', limit=1)

Should work.

The article body is inside an <nyt_text> tag, which is nested inside a <div> tag with id "articleBody". Inside the <nyt_text> element, the text itself is contained within <p> tags. Images are not within those <p> tags. It's difficult for me to experiment with the syntax, but I expect a working scrape to look something like this.

text = soup.findAll('nyt_text', limit=1)[0]
text.findAll('p')
Ewan Todd
I'm sure this works for this test case however, looking for a more generic answer that may be applied to various other websites... So far, I've tried using regexps to find <script></script> tags and <!-- .* --> comments and replace them with "" but that's even proving kinda difficult for sum reason..
+7  A: 

Try this:

html = urllib.urlopen('http://www.nytimes.com/2009/12/21/us/21storm.html').read()
soup = BeautifulSoup.BeautifulSoup(html)
texts = soup.findAll(text=True)

def visible(element):
    if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
        return False
    elif re.match('<!--.*-->', str(element)):
        return False
    return True

visible_texts = filter(visible, texts)
jbochi
thanks jbochi, this is a great solution.
@developerjay you're welcome :-)
jbochi