views:

65

answers:

1

I want to crawl useful resource (like background picture .. ) from certain websites. It is not a hard job, especially with the help of some wonderful projects like scrapy.

The problem here is I not only just want crawl this site ONE TIME. I also want to keep my crawl long running and crawl the updated resource. So I want to know is there any good strategy for a web crawler to get updated pages?

Here's a coarse algorithm I've thought of. I divided the crawl process into rounds. Each round URL repository will give crawler a certain number (like , 10000) of URLs to crawl. And then next round. The detailed steps are:

  1. crawler add start URLs to URL repository
  2. crawler ask URL repository for at most N URL to crawl
  3. crawler fetch the URLs, and update certain information in URL repository, like the page content, the fetch time and whether the content has been changed.
  4. just go back to step 2

To further specify that, I still need to solve following question: How to decide the "refresh-ness" of a web page, which indicates the probability that this web page has been updated ?

Since that is an open question, hopefully it will brought some fruitful discussion here.

A: 

If the site to be crawled supports a sitemap in it's robots.txt, that will provide a nice low resource way of checking URL freshness for the entire website.

Otherwise it involves doing if-modified-since GET's and keeping track of change intervals? Doesnt work well on dynamic content though.