views:

224

answers:

6

I run a small webserver, and lately it's been getting creamed by a search engine spider. What's the proper way to cool it down? Should I send it 5xx responses periodically? Is there a robots.txt setting I should be using? Or something else?

+6  A: 

Assuming that the spider is kind enough to respect robots.txt, you could restrict it from accessing your site with the following:

User-agent: *
Disallow: /

This will affect all spiders. Narrow it down by specifying the correct user-agent for the spider.

If the crawler doesn't respect your robots.txt, you might want to restrict it from accessing your site by blocking its IP in your firewall.

EDIT: You can read more about robots.txt here.

Simon Jensen
+2  A: 

Robots.txt should be your first port of call. The search bot should take note of these settings and stop hitting the pages that you deny access to. This is easily done by creating a file in the root of your website with the following syntax:

User-agent: *
Disallow: /

That syntax essentially says: All search bots (the wildcard *), you are not allowed to index anything under /. More information at robotstxt.org

If this doesn't work, the next step is to ban the IP address if possible.

Ray Booysen
+1  A: 
User-agent: *
Disallow: /
John T
+1  A: 

you can also build a sitemap and register the sitemap with the offending bot. The search engines will use the sitemap to determine which pages to hit, and how often. If your site is fully dynamic, it might no help so much, but if you have a lot of static pages, it's a good way to tell the spiders that nothing changes day to day.

jwmiller5
A: 

The robots.txt should be your first choice. However, if the bot misbehaves and you don't have control of the firewall you could set up an .htaccess restriction to ban it by IP.

Chris Nava
+1  A: 

If it's ignoring robots.txt, the second best thing is to ban it by its useragent string. Just banning the IP won't do much use as 99% of spiders these days are distributed over a bunch of servers.

Ant P.