Is there a way to force a spider to slow down its spidering of a wesbite? Anything that can be put in headers or robots.txt?
I thought i remembered reading something about this being possible but cannot find anything now.
Is there a way to force a spider to slow down its spidering of a wesbite? Anything that can be put in headers or robots.txt?
I thought i remembered reading something about this being possible but cannot find anything now.
If you're referring to Google, you can throttle the speed at which Google spiders your site by using your Google Webmaster account (Google Webmaster Tools).
There is also this, which you can put in robots.txt
User-agent: *
Crawl-delay: 10
Where crawl delay is specified as the number of seconds between each page crawl. Of course, like everything else in robots.txt, the crawler has to respect it, so YMMV.
I don't think robots will do anything except allow or disallow. Most of the search engines will allow you to customize how they index your site.
If you have a specific agent that is causing issues, you might either block it specifically, or see if you can configure it.
Beyond using the Google Webmaster tools for the Googlebot (see Robert Harvey's answer), Yahoo! and Bing support the nonstandard Crawl-delay
directive in robots.txt
:
http://en.wikipedia.org/wiki/Robots.txt#Nonstandard_extensions
When push comes to shove, however, a misbehaving bot that's slamming your site will just have to be blocked at a higher level (e.g. load balancer, router, caching proxy, whatever is appropriate for your architecture).
Maybe you can save bandwidth by disallowing robot access to part of the server:
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /junk/
Disallow: /css/
Disallow: /images/
This allow access to everything:
User-agent: *
Disallow:
See Throttling your web server for a solution using Perl. Randal Schwartz said that he survived a Slashdot attack using this solution.