Web crawler that can interpret javascript

views:

511

answers:

+4 Q:

Web crawler that can interpret javascript

Hi, I want to write a web crawler that can interpret JavaScript. Basically its a program in Java or PHP that takes a URL as input and outputs the DOM tree which is similar to the output in Firebug HTML window. The best example is Kayak.com where you can not see the resulting DOM displayed on the browser when you 'view source' but can save the resulting HTML though firebug.

How would I go about doing this? What tools exist that would help me?

+2 A:

You are more likely to have success in Java than in PHP. There is a pre-existing Javascript interpreter for Java called Rhino. It's a reference implementation, and well-documented.

Rhino is used in lots of existing Java apps to provide Javascript scripting ability within the app. I have also heard of it used to assist with performing automated tests in Javascript.

I also know that Java includes code that can parse and render HTML, though someone who knows more about Java than me can probably advise more on that. I am not denying it would be very difficult to achieve something like this; you'd essentially be re-implementing a lot of what a browser does.

thomasrutter 2010-04-20 01:57:28

hi thomasrutter, thank you for the pointer but I guess rhino is a JavaScript engine and probably I need to build a prototype browser using Rhino as JavaScript engine to crawl a JavaScript heavy page. Please correct me if I am wrong

2010-04-20 03:28:21

Java also includes HTML parsing/rendering abilities. Someone who knows more about Java than me might be able to advise better with that - my knowledge ends here.

thomasrutter 2010-04-20 04:11:18

+1 A:

I've been using HtmlUnit (Java). This was originally designed for unit testing pages. It's not perfect javascript, but it hasn't failed me in my limited usage. According to the site, it can run the following JS frameworks to a reasonable degree:

jQuery 1.2.6
MochiKit 1.4.1
GWT 2.0.0
Sarissa 0.9.9.3
MooTools 1.2.1
Prototype 1.6.0
Ext JS 2.2
Dojo 1.0.2
YUI 2.3.0

Jeff 2010-04-20 05:41:21

You could use Mozilla's rendering engine Gecko:

https://developer.mozilla.org/en/Gecko

RoToRa 2010-04-21 08:53:08

Google Chrome's v8 might also be an option here, http://code.google.com/p/v8/

phoenix24 2010-04-28 06:03:32

ansaurus

tags:

views:

answers:

Web crawler that can interpret javascript

related questions