is it possible to write code to detect if a website are spidering the content?
a good spider
- reads the robots.txt
- has a proper user-agent
- will query faster than an average user
But a clear detection if it's a browser or a spider is not possible i think.
You try using the user agent string to identify the bots.
Different bots seem to have different user agent strings:
http://www.useragentstring.com/pages/useragentstring.php
However, the user agent string can be easily spoof.
You can use a list of User-Agent strings that the common bots use. You can use some form of rate-detection and determine that a very high rate of requests will probably be a spider (or someone leeching your entire site).
There might also be lists of IP adresses used by common bots, but a fool-proof detection system is most-likely impossible.
You could create a link on your pages that a real visitor would never click and flag anyone that does follow the link as a spider. You will get some people clicking the link anyway but curiosity cannot be avoided.
If the spider is nice, you can detect it through it's user-agent using a list of existing user agents like this. But a nice webspider usually also follows robots.txt convention
Robots that ignore the robots.txt file and spoof their user-agent most likely also use other means to hide that they are a spider.