I want to crawl useful resource (like background picture .. ) from certain websites. It is not a hard job, especially with the help of some wonderful projects like scrapy.
The problem here is I not only just want crawl this site ONE TIME. I also want to keep my crawl long running and crawl the updated resource. So I want to know is there any good strategy for a web crawler to get updated pages?
Here's a coarse algorithm I've thought of. I divided the crawl process into rounds. Each round URL repository will give crawler a certain number (like , 10000) of URLs to crawl. And then next round. The detailed steps are:
- crawler add start URLs to URL repository
- crawler ask URL repository for at most N URL to crawl
- crawler fetch the URLs, and update certain information in URL repository, like the page content, the fetch time and whether the content has been changed.
- just go back to step 2
To further specify that, I still need to solve following question: How to decide the "refresh-ness" of a web page, which indicates the probability that this web page has been updated ?
Since that is an open question, hopefully it will brought some fruitful discussion here.