views:

1264

answers:

5

I need to do some HTML parsing with python. After some research lxml seems to be my best choice but I am having a hard time finding examples that help me with what I am trying to do. this is why i am hear. I need to scrape a page for all of its viewable text.. strip out all tags and javascript.. I need it to leave me with what text is viewable. sounds simple enough.. i did it with the HTMLParser but its not handling javascript well

class HTML2Text(HTMLParser.HTMLParser):
    def __init__(self):
        HTMLParser.HTMLParser.__init__(self)
        self.output = cStringIO.StringIO()

    def get_text(self):
        return self.output.getvalue()

    def handle_data(self, data):
        self.output.write(data)

def ParseHTML(source):
    p = HTML2Text()
    p.feed(source)
    text = p.get_text()
    return text

Any Ideas for a way to do this with lxml or a better way to do it HTMLParser.. HTMLParser would be best because no additional libs are needed.. thanks everyone

Scott F.

+3  A: 

No screen-scraping library I know "does well with Javascript" -- it's just too hard to anticipate all ways in which JS could alter the HTML DOM dynamically, conditionally &c.

Alex Martelli
A: 

BeautifulSoup (http://www.crummy.com/software/BeautifulSoup/) is often the right answer to python html scraping questions.

Paul Hankin
A: 

I know of no Python HTML parsing libraries that handle running javascript in the page being parsed. It's not "simple enough" for the reasons given by Alex Martelli and more.

For this task you may need to think about going to a higher level than just parsing HTML and look at web application testing frameworks.

Two that can execute javascript and are either Python based or can interface with Python:

Unfortunately I'm not sure if the "unit testing" orientation of these frameworks will actually let you scrape out visible text.

So the only other solution would be to do it yourself, say by integrating python-spidermonkey into your app.

Van Gale
A: 

Your code is smart and very flexible to extent, I think.

How about simply adding handle_starttag() and handle_endtag() to supress the <script> blocks?

class HTML2Text(HTMLParser.HTMLParser):
    def __init__(self):
        HTMLParser.HTMLParser.__init__(self)
        self.output = cStringIO.StringIO()
        self.is_in_script = False
    def get_text(self):
        return self.output.getvalue()
    def handle_data(self, data):
        if not self.is_in_script:
            self.output.write(data)
    def handle_starttag(self, tag, attrs):
        if tag == "script":
            self.is_in_script = True
    def handle_endtag(self, tag):
        if tag == "script":
            self.is_in_script = False

def ParseHTML(source):
    p = HTML2Text()
    p.feed(source)
    text = p.get_text()
    return text
A: 

scape.py can do this for you.

It's as simple as:

import scrape
s = scrape.Session()
s.go('yoursite.com')
print s.doc.text

Jump to about 2:40 in this video for an awesome overview from the creator of scrape.py: pycon.blip.tv/file/3261277

PPC-Coder