tags:

views:

37

answers:

2

Hi Everyone,

Anybody has any idea on crawling websites that have dynamic pages/queries? I mean if I click a certain link, it has different values every I try to reload it in a web browser. Now my webcrawler could not download the contents of these pages. Please advise.

A: 

You might want to look at this question which details how to write a crawler or look at the source code for http://searcharoo.net/ which contains a good crawler (see here).

Kane
Hi Kane, thanks for your reply and Searcharoo is interesting however, if there's anyone out there who can pinpoint how this (how to download pages from dynamic links) can be done, that can be of big help. Looking at the codes of Searcharoo, I might take some time to understand their architecture.
Jojo
+1  A: 

it would be the same way even it is dynamic or not. actually a crawler is only a mater of 3 things

  1. The url
  2. The data it sent to server if it is a POST Method then
  3. The cookie if authentication is required

that's all,

the common problem when doing crawler:

  1. Miss-guess of default page [index.html, index.php, default.aspx etc].. actually it will work without it for all method [POST/GET]
  2. One of each field name is not written exactly
  3. ASP.Net form viewstate id field (i forgot the name) but i can be achieve easily
  4. Dynamic page generated by javascript. this one is the hardest part and the most cases even google still have problem about this.

hope that help.

ktutnik