tags:

views:

110

answers:

4

is it possible to write code to detect if a website are spidering the content?

+3  A: 

a good spider

  • reads the robots.txt
  • has a proper user-agent
  • will query faster than an average user

But a clear detection if it's a browser or a spider is not possible i think.

Sebastian Sedlak
+1  A: 

You try using the user agent string to identify the bots.

Different bots seem to have different user agent strings:

http://www.useragentstring.com/pages/useragentstring.php

However, the user agent string can be easily spoof.

maxyfc
+1  A: 

You can use a list of User-Agent strings that the common bots use. You can use some form of rate-detection and determine that a very high rate of requests will probably be a spider (or someone leeching your entire site).

There might also be lists of IP adresses used by common bots, but a fool-proof detection system is most-likely impossible.

You could create a link on your pages that a real visitor would never click and flag anyone that does follow the link as a spider. You will get some people clicking the link anyway but curiosity cannot be avoided.

Gerco Dries
+1  A: 

If the spider is nice, you can detect it through it's user-agent using a list of existing user agents like this. But a nice webspider usually also follows robots.txt convention

Robots that ignore the robots.txt file and spoof their user-agent most likely also use other means to hide that they are a spider.

Caotic