views:

154

answers:

3

Is there some standard time duration that a crawler must wait for between repeated hits to the same server...so as not to overburden the server.

If not, any suggestions on what can be a good waiting period for the crawler to be considered polite.

Does this value also vary from server to server... and if so how can one determine it.

Any thoughts would be great...

Thanks

A: 

That will depend on how often the content changes. For example, it makes sense to crawl a news site more often than a site with static articles.

As to exactly how to determine the optimum - it will depend on how you judge the cost of fetching, indexing etc against the value of having up-to-date data. That's entirely up to you - but you will probably have to use some heuristics to work out how much the site is changing over time, based on observations. If a site hasn't changed for three fetches in a row, you might want to wait a little bit longer before fetching next time. Conversely, if a site always changes every time you fetch it, you might want to be a little bit more aggressive to avoid missing updates.

Jon Skeet
Curious, should the crawler try to respect the HTTP meta 'expires' tag?
dirkgently
You can use a logarithmic pattern to adjust the time interval, based on the time interval between content changes.
Miguel Ping
+2  A: 

This article on IBM goes into some detail on how the Web crawler uses the robots exclusion protocol and recrawl interval settings in the Web crawler

To quote the articles.

The first time that a page is crawled, the crawler uses the date and time that the page is crawled and an average of the specified minimum and maximum recrawl intervals to set a recrawl date. The page will not be recrawled before that date. The time that the page will be recrawled after that date depends on the crawler load and the balance of new and old URLs in the crawl space.

Each time that the page is recrawled, the crawler checks to see if the content has changed. If the content has changed, the next recrawl interval will be shorter than the previous one, but never shorter than the specified minimum recrawl interval. If the content has not changed, the next recrawl interval will be longer than the previous one, but never longer than the specified maximum recrawl interval.

This is about their web crawler but is very useful in reading while building your own tool.

Ólafur Waage
A: 

I don't think there is a minimum interval on how often you can hit a site, as it is highly dependent on current server load and server capability.

You can try to test the response time and time-out rates, if one site is responding slowly or getting you time-out errors, you should increase your re-hit interval, even though it might not be your crawler causing the slowness or time-outs.

FlyinFish