views:

661

answers:

5

I need to capture a web site and am looking for an appropriate library or program to do this. The website uses Java Script and pushes updates to the page and I need to capture these as well as the page itself. I am using curl to capture the page itself but I don't know how to capture the updates. Where given a choice I would use C++.

Regards

+1  A: 

Take a look at SpiderMonkey.

I've not actually used it in anger so am unsure if it will do what you want. I have come across it used optionally with the Scrapy web-crawling and screen-scraping framework written in Python.

Alternatively, can you reverse-engineer how the JavaScript push updates are carried out, and access these directly. It sounds like you'll need to store these updates and/or apply them to the base HTML page.

Mat
A: 

Problem is your web pages are updating because script code is executing on the page. Using curl isn't going to get you there for that ..

Not sure of your exact needs .. but you could write a javascript injector bookmarklet that adds a button to any web page and lets you grab the DOM or body html manually whenever you want... This is how many of the clip marking apps work.

If you need something that automatically captures updates as they occur - like a movie .. then you're going to need something more involved...

Scott Evernden
+2  A: 

If you still want to use c++ and curl try to figure out what the javascript in the page does - I assume it just uses the timer to send a AJAX request and updates the page (although it could be more complicated). Use a tool like firefox with firebug (the "Net" spying is what you want) to see what kind of a request it is - you'll get:

  • an url of the request
  • parameters
  • the returned contents (it could be html, text, xml or json)

With a bit of luck you'll have enough to mimic the behavior in c++ with curl. If you can't make anything out of the gathered data, you'll have to browse through the javascript and try to figure out what it is doing (but most of the time page updates are really simple).

The easy way to do this would be to do this inside a browser, eg. as a firefox plugin (written in javascript) - if this is needed for anything other than a pet project this might be a bit "unelegant", but it should be really easy to do:

  • monitor the DOM tree for updates (html DOM level 2 has all kinds of "mutation" events, but I never used them so I don't know much about them or if they "work"/are supported - see DOM mutation events). There is even a possibility this kind of stuff would work in greasemonkey which would mean you wouldn't have to make a full firefox plugin - eg. Post-processing a page after it renders should get you started (you don't want to track 'load', but something like "DOMSubtreeModified"). If the mutation events don't work you can always use a timer and compare the html contents.
  • or do as the firebug does and monitor the network requests and do something with the results
Hrvoje Prgeša
+2  A: 

Install Firefox and GreaseMonkey. Have the GM script add DOM events where appropriate to track modifications. You can then use XMLHttpRequest to send the information to a server, or write them to local files with XPCOM file IO opearation.

With this, you can do what you want in a dozen lines and little to no reverse engineering, whereas what others have advised (screen scraping) will require thousands of lines of code for a JavaScript heavy site IMO.

Addenda: this is /not/ a job for C++. And should you do it in C++ anyway, you will end up havin to reverse engineer JS, so you might as well just learn enough JS to use GreaseMonkey in the first place.

niXar
+1  A: 

If you are looking for static web page scraping BeautifulSoup (Python) is one of the best and easiest.

If you are looking to scrape some javascript rendered tickers or something, that cannot be done until the page is rendered, hence not possible with BeautifulSoup alone. you will have to use a headless browser like Crowbar - Similie (uses XULRunner) which renders the javascript content on a headless browser and the output of this rendered content can be used as an input to the BeautifulSoup scraper.

JV