views:

257

answers:

2

I need to build a small "monitoring" scraper for a 3rd party website (it's an external website that has stats about our visitors).

Unfortunately, this website is very hard to scrape through the normal "wget" mechanism, because it uses a ton of sophisticated JS, part of it generated by GWT. So my workaround was to create a GreaseMonkey script and then have this script call a PHP page that would log the scraped data. Then as soon as Firefox starts with this webpage-to-scrape, the script goes to work.

This works well, but now I am trying to make it more robust as far as monitoring tools go. I want it to run on the server using a cron job. As far as I understand such things, this requires a DISPLAY variable to be set and for an X session to exist (Firefox is refusing to run for me). Is there any nice way to allow it to run from the batchuser account as a cron job?

+1  A: 

I've done something similar to get Selenium running headless on a server. I used Xvfb.

http://en.wikipedia.org/wiki/Xvfb

This article has some tips for using Xvfb with Firefox:

http://semicomplete.com/blog/geekery/xvfb-firefox.html

Nate
Perfect! Exactly what I'm looking for. Small typo: times -> tips?
Artem
Will leave this up for a little longer to see any other alternative solutions.
Artem
A: 

The best way to do that is to build Firefox in the headless mode: http://hg.mozilla.org/incubator/offscreen

Paul Rouget