views:

144

answers:

2

Hello everyone, I have a webcrawler application. It successfully crawled most common and simple sites. Now i ran into some types of websites wherein HTML documents are dynamically generated through FORMS or javascripts. I believe they can be crawled and I just don't know how. Now, these websites do not show the actual HTML page. I mean if I browse that page in IE or firefox, the HTML code does not match what's actually in the IE or firefox. These sites contain textboxes, checkboxes, etc... so I believe they are what they call "Web Forms". Actually I am not much familiar with web development so correct me if I'm wrong.

My question is, does anyone in similar situation as I am now and have successfully solved these types of "challenges"? Does anyone know of a certain book or article regarding web crawling? Those that pertains to these advanced type of websites?

Thanks.

+1  A: 

There are two separate issues here.

Forms

As a rule of thumb, crawlers do not touch forms.

It might be appropriate to write something for a specific website, that submits predetermined (or semi-random) data (particularly when writing automated tests for your own web applications), but generic crawlers should leave them well alone.

The spec describing how to submit form data is available at http://www.w3.org/TR/html4/interact/forms.html#h-17.13, there may be a library for C# that will help.

JavaScript

JavaScript is a rather complicated beast.

There are three common ways you can deal with it:

  1. Write your crawler so it duplicates the JS functionality of specific websites that you care about.
  2. Automate a web browser
  3. Use something like Rhino with env.js
David Dorward
Hi David,Thanks for your info. It's a good start. You mentioned a generic crawler. Actually, that is what I am designing. I am trying to make it as generic as possible. I'm looking for a good book or any resource out there about web crawling. I can't find one. Do you know something? Again, tnx.
Jojo
A: 

I found an article which tackles deep web and its very interesting and I think this answers my questions above.

http://trycatchfail.com/blog/post/2008/11/10/Creating-a-deep-web-crawler-with-NET-Background.aspx

Gotta love this.

Jojo