ansaurus

Question

Options for handling javascript heavy pages while screen scraping

Answer 1

A:

A fourth option might be to use browserjs.

This is supposed to be a way to run a browser environment in Mozilla Rhino or some other command-line javascript engine. Presumably you could (at least in theory) load the page in that environment and dump the HTML after JS has had its way with it.

I haven't really used it myself, I tried a couple of times but found it way too slow for my purposes. I didn't try very hard though, there might be an option you need to set or some such.

intuited 2010-10-14 00:42:37

Answer 2

A:

I use Chickenfoot for simple tasks and python-webkit for more complex. Have had good experience with both.

Here is a snippet to render a webpage (including executing any JavaScript) and return the resulting HTML:

class Render(QWebPage):
  def __init__(self, url):
    self.app = QApplication(sys.argv)
    QWebPage.__init__(self)
    self.loadFinished.connect(self._loadFinished)
    self.mainFrame().load(QUrl(url))
    self.app.exec_()

  def _loadFinished(self, result):
    self.html = str(self.mainFrame().toHtml())
    self.app.quit()

html = Render(url).html

Plumo 2010-10-14 01:56:05

This looks really interesting. One problem I run into is that I get a object of type QString. If I try to pass this to lxml I run into problems since it has no idea what that is. How do I convert QString into a unicode string? The chickenfoot thing was also really cool! I've written tons of scripts in just an hour.

2010-10-14 05:07:02

Yeah QString's are annoying when integrating with other libraries. Fortunately you can easily convert with str(qstring_variable).

Plumo 2010-10-15 01:26:41

Answer 3

A:

You might be able to find the data you are looking for elsewhere. Try using the web-developer toolbar in firefox to see what is being loaded by javascript. It might be that you can find the data in the js files.

Otherwise, you probably do need to use Mechanize. There are two tutorials that you might find useful here:

http://scraperwiki.com/help/tutorials/

ScraperWiki 2010-10-14 08:52:12

Answer 4

A:

For nonprogrammers, I recomment using IRobotSoft. It is visual oriented and with full javascript support. The shortcoming is that it runs only on Windows. The good thing is you can become an expert just by trial and error to learn the software.

seagulf 2010-10-21 21:32:36

ansaurus

tags:

views:

answers:

Options for handling javascript heavy pages while screen scraping

related questions