tags:

views:

181

answers:

5

Hi,

There are many tools that scrape HTML pages with javascript off, however are there any that will scrape with javascript on, including pressing buttons that are javascript callbacks?

I'm currently trying to scrape a site that is soley navigated through javascript calls. All the buttons that lead to the content execute javascript without a href in sight. I could reverse engineer the javascript calls (that do, in part return HTML) but that is going to take some time, are there any short cuts?

+1  A: 

Have you tried using scRubyIt? I'm not 100% sure, but I think I used it to scrape somo dynamic web sites.

It has some useful methods like

click_link_and_wait 'Get results', 5
Yaraher
+1  A: 

Win32::IE::Mechanize

David Dorward
A: 

At the end of the day, those website which do not use Flash or other embedded plugins will need to make HTTP requests from the browser to the server. Most, if not all of those requests will have patterns within their URI's. Use Firebug/LiveHTTPHeaders to capture all the requests, which in turn will let you see what data comes back. From there, you can build ways to grab the data you want.

That is, of course, they are not using some crappy form of obfuscation/encryption to slow you down.

squeeks
Aye, I have been doing this but it is akin to reverse engineering the site and providing the calls. There are authentication issues with this approach and the API is pretty horrible. I'd rather not have to understand it all.... that's probably me just being lazy though! :)
Quibblesome
+2  A: 

I use htmlunit, generally wrapped in a Java-based scripting language like JRuby. HtmlUnit is fantastic because it's JavaScript engine handles all of the dynamic functionality including AJAX behind the scenes. Makes it very easy to scrape.

Rob Di Marco
+1  A: 

You could use Watij if you're into Java ( and want to automate Internet Explorer ). Alternatively, you can use Webdriver and also automate Firefox. Webdriver has a Python API too.

Geo