views:

702

answers:

4

Hi Guys. I'm writing a web crawler (web spider) that crawl all links in a website. My application is a Win32 App, written in C# with .Net framework 3.5. Now I'm using HttpWebRequest an HttpWebResponse to communicate with the web server. I also built my own Http Parser that can parse anything I want. I found all link like "href", "src", "action"... in the parse. But I can not solve one problem: Simulate Client Script in the page (like JS and VBS) For example, if a link like:

a href = "javascript:buildLink(1)"

... with buildLink(parameter) is a Javascript function that will make a custom link due to the parameter.

Please help me to solve this problem. How to simulate JavaScript in this app? I can parse the HTML source code and take all JavaScript code to another file, but how to simulate a function of it? Thanks.

+1  A: 

This is a problem which is not easily solved. You could consider taking one of the existing JavaScript implementations and porting or interfacing with it somehow.

If I were tackling this problem, I'd probably build a small side application in Java on top of Rhino, with some sort of RPC framework layered on top of that so that I could communicate with it from my primary application.

Unfortunately, without having a complete DOM implementation on top of that, you would be limited to only very simple javascript.

Nick
+2  A: 

You are basically pretending to be a browser, except that HttpWebRequest only does the networking stuff for you.

I would recommend using the ie web browser control and interop'ing into that from your c# application. That will allow you to run javscript, set variables, post, etc etc.

Here's some basic links i found after a search for "ie web browser control":

http://www.c-sharpcorner.com/UploadFile/mahesh/WebBrowserInCSMDB12022005001524AM/WebBrowserInCSMDB.aspx http://support.microsoft.com/kb/313068

will
+1  A: 

You could execute the javascript by using the MS JScript engine or something similar.

This isn't guaranteed to work, especially if the javascript tries to access the DOM, or somesuch... But for simple scripts, it might be enough.

Stobor
+1  A: 

Your only real option is to automate a browser. As other answers have said, you cannot reliably simulate browser javascript without having a complete DOM.

There are fortunately ways to automate the browser, check out Selenium.

It has a C# API, so you can control the browser from C#.

Use your .NET web crawler code to crawl the site. Whenever you encounter a href="javascript:... link, handle the page containing the link in Selenium:

  1. Use the Selenium API to tell the browser to load the page.
  2. Use the Selenium API to find all links on the page.

This way, your spider only uses Selenium when necessary (pages without javascript links can be handled by the browser-less spider code you already got). And since this is an embarrassingly parallel workload, you could easily have multiple Selenium processes running at the same time (either on one computer or on other computers).

But remember that href="javascript is hardly the only way a page can have dynamic links. The more common case is probably that a onload or $(document).ready() script manipulates the DOM and adds links that way.

To catch that case (and others), the spider probably will have to use Selenium for all pages that have a <script> tag.

codeape