Detecting if your site is being accessed by a robot

views:

answers:

+2 Q:

Detecting if your site is being accessed by a robot

I have some geo targeting code whcih I want to behave in a particular way if the site is being spidered by a robot e.g. google etc.

Is there any way to infer this?

+3 A:

Presenting different content to search engine crawlers and human visitors - called cloaking - is a risky thing, and can be punished by the search engine if detected.

That said, check out this SO answer with several links to well-maintained "bot lists". You would have to parse the USER_AGENT string and compare it against such a bot list.

Pekka 2010-02-15 17:10:34

+1 A:

You can check this by the user-agent property. For more info on user agent strings, check here: http://www.user-agents.org/ Mark the records with type "R = Robot, crawler, spider ". Bit this is not guaranteed, the user-agent property might be changes by several factors and this is not 100% reliable.

anthares 2010-02-15 17:11:52

+1 A:

You can do it by checking for the user-agent, or the IP. It may be preferable to use the latter as it's not unknown for other, less reputable bots, to spoof the user-agent of the big guys. Even for google et al their IPs tend to be in narrow ranges, so detecting on IP shouldn't require compiling of vast lists.

Richard 2010-02-15 17:33:37

If you are only interested in the well set up reputable bots e.g. Google, Yahoo, MSN/Live/Bing/whatever-it-is-today, Ask etc then you can use round trip DNS checking.

1) Check for known user agent (look for known substring such as googlebot)
e.g. Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html

2) Do a reverse DNS for the requesting IP and check that it comes from a reasonable domain.
e.g. rdns of 66.249.71.202 is crawl-66-249-71-202.googlebot.com (so happy that it comes from googlebot.com)

3) On it's own step 2 can be faked, so now check the dns of the A record for the result returned in step 2 and ensure you have the original requesting IP.
e.g. dns for above is
crawl-66-249-71-202.googlebot.com. A 66.249.71.202

66.249.71.202 was the requesting IP address so this is a valid googlebot.

status203 2010-02-17 10:36:02

ansaurus

tags:

views:

answers:

Detecting if your site is being accessed by a robot

related questions