views:

253

answers:

2

I have a simple web crawler to request all the pages from a website's sitemap that I need to cache and index. After several requests, the website begins serving blank pages.

There is nothing in their robots.txt except the link to their sitemap, so I assume I am not breaking their "rules". I have a descriptive header that links to exactly what my intentions are, and the only pages I crawl are from their sitemap.

The http status codes are all still OK, so I can only imagine they're preventing large numbers of http requests in a short period of time. What is considered a reasonable amount of delay between requests?

Are there any other considerations I've overlooked that could potentially cause this problem?

+4  A: 

Every site has different crawler and abuse characteristics it looks for.

The key for any crawler is to emulate human activity, and obey the robots.txt.

An exhaustive crawl will trip some websites, and they'll shut you down regardless of how slow you go, whereas some hosts don't mind crawlers zipping along and sucking everything up in one go.

  • In general you don't want to request pages more quickly than 6 per minute (about human speed).
  • You'll be safer following links in order of visibility on the webpage.
  • Try to ignore links that are not visible on the webpage (lots of people use honeypots).

If all else fails, don't request more quickly than one page per minute. If a website blocks you at this rate, then contact them directly - they obviously don't want you to use their content in that way.

Adam Davis
+1  A: 

I guess Wikipedia has a decent reference on the topic. Obey them and, for courtesy, a bit more.

For example, I'd probably would max the connection speed at one hit per second, or I'd be risking an inadvertent DoS-attack.

Henrik Paul