views:

66

answers:

4

Disclaimer here: I'm really not a programmer. I'm eager to learn, but my experience is pretty much basic on c64 20 years ago and a couple of days of learning Python.

I'm just starting out on a fairly large (for me as a beginner) screen scraping project. So far I have been using python with mechanize+lxml for my browsing/parsing. Now I'm encountering some really javascript heavy pages that doesn't show a anything without javascript enabled, which means trouble for mechanize.

From my searching I've kind come to the conclusion that I have a basically a few options:

  1. Trying to figure out what the javascript is doing a emulate that in my code (I don't quite know where to start with this. ;-))

  2. Using pywin32 to control internet explorer or something similar, like using the webkit-browser from pyqt4 or even using telnet and mozrepl (this seems really hard)

  3. Switching language to perl since www::Mechanize seems be a lot more mature on per (addons and such for javascript). Don't know too much about this at all.

If anyone has some pointers here that would be great. I understand that I need to do a lot of trial and error, but would be nice I wouldn't go too far away from the "true" answer, if there is such a thing.

A: 

A fourth option might be to use browserjs.

This is supposed to be a way to run a browser environment in Mozilla Rhino or some other command-line javascript engine. Presumably you could (at least in theory) load the page in that environment and dump the HTML after JS has had its way with it.

I haven't really used it myself, I tried a couple of times but found it way too slow for my purposes. I didn't try very hard though, there might be an option you need to set or some such.

intuited
A: 

I use Chickenfoot for simple tasks and python-webkit for more complex. Have had good experience with both.

Here is a snippet to render a webpage (including executing any JavaScript) and return the resulting HTML:

class Render(QWebPage):
  def __init__(self, url):
    self.app = QApplication(sys.argv)
    QWebPage.__init__(self)
    self.loadFinished.connect(self._loadFinished)
    self.mainFrame().load(QUrl(url))
    self.app.exec_()

  def _loadFinished(self, result):
    self.html = str(self.mainFrame().toHtml())
    self.app.quit()

html = Render(url).html
Plumo
This looks really interesting. One problem I run into is that I get a object of type QString. If I try to pass this to lxml I run into problems since it has no idea what that is. How do I convert QString into a unicode string? The chickenfoot thing was also really cool! I've written tons of scripts in just an hour.
Yeah QString's are annoying when integrating with other libraries. Fortunately you can easily convert with str(qstring_variable).
Plumo
A: 

You might be able to find the data you are looking for elsewhere. Try using the web-developer toolbar in firefox to see what is being loaded by javascript. It might be that you can find the data in the js files.

Otherwise, you probably do need to use Mechanize. There are two tutorials that you might find useful here:

http://scraperwiki.com/help/tutorials/

ScraperWiki
A: 

For nonprogrammers, I recomment using IRobotSoft. It is visual oriented and with full javascript support. The shortcoming is that it runs only on Windows. The good thing is you can become an expert just by trial and error to learn the software.

seagulf