views:

376

answers:

4

I would like to detect (on the server side) which requests are from bots. I don't care about malicious bots at this point, just the ones that are playing nice. I've seen a few approaches that mostly involve matching the user agent string against keywords like 'bot'. But that seems awkward, incomplete, and unmaintainable. So does anyone have any more solid approaches? If not, do you have any resources you use to keep up to date with all the friendly user agents?

If you're curious: I'm not trying to do anything against any search engine policy. We have a section of the site where a user is randomly presented with one of several slightly different versions of a page. However if a web crawler is detected, we'd always give them the same version so that the index is consistent.

Also I'm using Java, but I would imagine the approach would be similar for any server-side technology. Thanks!

+2  A: 

Any visitor whose entry page is /robots.txt is probably a bot.

Sparr
Or, to be less strict, a visitor who requests robots.txt at all is probably a bot, although there are a few firefox plugins that grab it while a human is browsing.
Sparr
Any bot that goes there is probably a well-behaved, respectable bot, the kind you might want visiting your site :-)
Hightechrider
+3  A: 

You can find a very thorough database of data on known "good" web crawlers in the robotstxt.org Robots Database. Utilizing this data would be far more effective than just matching bot in the user-agent.

Sparr
+3  A: 

One suggestion is to create an empty anchor on your page that only a bot would follow. Normal users wouldn't see the link, leaving spiders and bots to follow. For example, an empty anchor tag that points to a subfolder would record a get request in your logs...

<a href="dontfollowme.aspx"></a>

Many people use this method while running a HoneyPot to catch malicious bots that aren't following the robots.txt file. I use the empty anchor method in an ASP.NET honeypot solution I wrote to trap and block those creepy crawlers...

Dscoduc
Just out of curiosity, this made me wonder if that might mess with accessibility. Like if someone could accidentally select that anchor using the Tab key and then hit Return to click it after all. Well, apparently not (see http://jsbin.com/efipa/ for a quick test), but of course I've only tested with a normal browser.
Arjan
Need to be a little bit careful with techniques like this that you don't get your site blacklisted for using blackhat SEO techniques.
Hightechrider
+1  A: 

Something quick and dirty like this might be a good start:

return if request.user_agent =~ /googlebot|msnbot|baidu|curl|wget|Mediapartners-Google|slurp|ia_archiver|Gigabot|libwww-perl|lwp-trivial/i

Note: rails code, but regex is generally applicable.

Brian Armstrong