saving / mirroring / crawling web pages that use javascript to generate content

views:

308

answers:

+2 Q:

saving / mirroring / crawling web pages that use javascript to generate content

I want to download web pages that use javascript to output the data. Wget can do everything else, but run javascript.

Even something like:firefox -remote "saveURL(www.mozilla.org, myfile.html)"

would be great (unfortunately that kind of command does not exist).

+4 A:

I'd look at the selenium browser automation tool (http://seleniumhq.org/) - you can automate visiting a webpage, and saving the resultant HTML.

We used it to great success for a similar purpose on a prior project.

Chaos 2009-03-24 23:16:33

If it can be a Windows based app, you can try using the browser component of any programming language like C#, Visual Basic, Delphi, etc to load the page and then peek into the content and save it. The browser component should be based on IE rendering engines and should support JavaScript. There's a question regarding snapshots of websites here. May be of some use to you.

Alternately, you could consider building your own Firefox extension. Take a peek here for further details (there's no "next" button, just the menu on the left for navigation, confused me at first).

evilpenguin 2009-03-24 23:23:14

+1 A:

I second Alex's suggestion for Selenium. It runs in the browser so it can capture output HTML after Javascript has modified the DOM.

Eric Wendelin 2009-03-24 23:55:32

+1 A:

The problem with using a browser-driven approach is that it'll be hard to automate the process of scraping.

Look for a "headless browser" in your favourite programming language of choice. Alternatively, you can use Jaxer to load the DOM serverside, execute the JavaScript and let it manipulate the DOM, and then scrape the modified DOM using the same JavaScript you are already familiar with. This would be my preferred approach.

Rakesh Pai 2009-03-25 08:13:47

I have done this before using:

Plumo 2010-03-17 06:26:38

ansaurus

tags:

views:

answers:

saving / mirroring / crawling web pages that use javascript to generate content

related questions