ansaurus

Question

Answer 1

+3 A:

No screen-scraping library I know "does well with Javascript" -- it's just too hard to anticipate all ways in which JS could alter the HTML DOM dynamically, conditionally &c.

Alex Martelli 2009-05-02 05:47:06

Answer 2

A:

BeautifulSoup (http://www.crummy.com/software/BeautifulSoup/) is often the right answer to python html scraping questions.

Paul Hankin 2009-05-02 08:33:29

Answer 3

A:

I know of no Python HTML parsing libraries that handle running javascript in the page being parsed. It's not "simple enough" for the reasons given by Alex Martelli and more.

For this task you may need to think about going to a higher level than just parsing HTML and look at web application testing frameworks.

Two that can execute javascript and are either Python based or can interface with Python:

Unfortunately I'm not sure if the "unit testing" orientation of these frameworks will actually let you scrape out visible text.

So the only other solution would be to do it yourself, say by integrating python-spidermonkey into your app.

Van Gale 2009-05-02 09:40:04

Answer 4

A:

Your code is smart and very flexible to extent, I think.

How about simply adding handle_starttag() and handle_endtag() to supress the <script> blocks?

class HTML2Text(HTMLParser.HTMLParser):
    def __init__(self):
        HTMLParser.HTMLParser.__init__(self)
        self.output = cStringIO.StringIO()
        self.is_in_script = False
    def get_text(self):
        return self.output.getvalue()
    def handle_data(self, data):
        if not self.is_in_script:
            self.output.write(data)
    def handle_starttag(self, tag, attrs):
        if tag == "script":
            self.is_in_script = True
    def handle_endtag(self, tag):
        if tag == "script":
            self.is_in_script = False

def ParseHTML(source):
    p = HTML2Text()
    p.feed(source)
    text = p.get_text()
    return text

2009-09-05 14:47:08

Answer 5

A:

scape.py can do this for you.

It's as simple as:

import scrape
s = scrape.Session()
s.go('yoursite.com')
print s.doc.text

Jump to about 2:40 in this video for an awesome overview from the creator of scrape.py: pycon.blip.tv/file/3261277

PPC-Coder 2010-04-08 21:02:24

ansaurus

tags:

views:

answers:

Python lxml screen scraping?

related questions