views:

766

answers:

8

Page-scraping on the Internet has seem to have hit somewhat of a wall for me, as there are more and more sites that are dependent on JavaScript for rendering portions of the screen.

It seems to me that with so many open source layout and JavaScript renderers released (like WebKit, Gecko and Chromium + V8) that someone must have made a tool for downloading a page and rendering its JavaScript without having to run an actual browser. However, I'm not turning up what I'm looking for with my searches - I've found tools like Selenium-rc, but they depend on a running browser. I'm interested in any tool or library which can do one (or both) of the following:

  1. A program that can be run from the command line (*nix) which, given the source of a page, returns the page's source as rendered by some JS engine.

  2. Integrated support in a particular language that allows one to (easily) pass the source of a page to it and returns the page's source as rendered by some JS engine.

I think #1 is preferable in a general sense, but #2 would be more useful if the tool exists in the language I want to work in. Also, I'm not concerned with the particular JS engine - any relatively modern one will do. What is out there?

A: 

There is the Cobra Engine for Java (http://lobobrowser.org/cobra.jsp), which handles Javascript (it also has a renderer, but that is optional). I've never used it, but have heard nice things said about it.

David
+1  A: 

You can look at HTMLUnit. It's main purpose is automatic web testing, but I think it may let you get the rendered page.

Sergey
+1  A: 

We used Rhino sometime ago to do some automated testing from Java. It seems it'll do the job for you :)

Seb
A: 

Do you mean, how can I steal your javascript, as well as your content? Don't forget the css.

kennebec
And I would've gotten away with it too, if it weren't for you meddling kids!
Daniel Lew
Yeah, javascript is REALLY well hidden via http you know. It's not a GET request away or anything.
Alex Fort
+2  A: 

Well, there's the DumpRenderTree tool which is used as part of the WebKit test suites. I'm not sure how suitable it is for turning into a standalone tool, but it does what you ask for (render HTML, run JavaScript, and dump its render tree out to disk).

Brian Campbell
A: 

i think there's an example code for Qt that uses the included WebKit to render a page to a pixmap. from there to a full CLI utility is just defining your needs.

of course, for most screen-scraping need you want the text, not a pixmap... if that's what you want, better check Rhino

Javier
A: 

It's very little code to have a WebView render a page without displaying anything, but it has to be a GUI application. They can take command line arguments as well, and hide the window. Using WebKit directly it might be possible in a tool.

Apart from the complicated DOM access in Objective-C WebKit can also inject JavaScript, and together with jQuery that makes for a nice scraping solution. I don't know of any universal application doing that, though.

Tobias
A: 

Since JavaScript can do quite a lot of manipulations to the web page's document object model (DOM), it seems like to accurately scrape the content of an arbitrary page, you'd need to not only run a JavaScript engine, you'd also need a complete and accurate DOM representation of the page. That's something you'll only get if you have a real browser engine instantiated. It is possible to use an embedded, not-displayed WebKit or Gecko engine for this, then after a suitable loading delay to allow for script execution, just dump the DOM contents in HTML form.

Ben Combee
That's exactly what I want, sorry if I didn't explain it correctly in my post. I understand that you'll need both a DOM and JS engine to get what I want. If you could explain the last part in more detail that'd be appreciated.
Daniel Lew