ansaurus

Question

Answer 1

+1 A:

Have you tried using scRubyIt? I'm not 100% sure, but I think I used it to scrape somo dynamic web sites.

It has some useful methods like

click_link_and_wait 'Get results', 5

Yaraher 2009-09-15 12:56:21

Answer 2

+1 A:

Win32::IE::Mechanize

David Dorward 2009-09-15 13:00:54

Answer 3

A:

At the end of the day, those website which do not use Flash or other embedded plugins will need to make HTTP requests from the browser to the server. Most, if not all of those requests will have patterns within their URI's. Use Firebug/LiveHTTPHeaders to capture all the requests, which in turn will let you see what data comes back. From there, you can build ways to grab the data you want.

That is, of course, they are not using some crappy form of obfuscation/encryption to slow you down.

squeeks 2009-09-15 13:04:52

Aye, I have been doing this but it is akin to reverse engineering the site and providing the calls. There are authentication issues with this approach and the API is pretty horrible. I'd rather not have to understand it all.... that's probably me just being lazy though! :)

Quibblesome 2009-09-15 13:07:01

Answer 4

+2 A:

I use htmlunit, generally wrapped in a Java-based scripting language like JRuby. HtmlUnit is fantastic because it's JavaScript engine handles all of the dynamic functionality including AJAX behind the scenes. Makes it very easy to scrape.

Rob Di Marco 2009-09-15 13:41:00

Answer 5

+1 A:

You could use Watij if you're into Java ( and want to automate Internet Explorer ). Alternatively, you can use Webdriver and also automate Firefox. Webdriver has a Python API too.

Geo 2009-09-15 13:45:04

ansaurus

tags:

views:

answers:

Webscraping a javascript based website

related questions