You can go by either IP addresses or the 'User-Agent' string that the bot or web browser sends you.
When Googlebot (or most other well-behaving robots) visit your website, they'll send you a $_SERVER['HTTP_USER_AGENT'] variable which identifies what they are. Some examples are:
Googlebot/2.1 (+http://www.google.com/bot.html)
NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html
Baiduspider+(+http://www.baidu.com/search/spider_jp.html)
Mozilla/5.0 (X11; U; Linux i686; en-US) AppleWebKit/531.4 (KHTML, like Gecko)
You can find many more examples at these websites:
link text
link text
You could then use PHP to examine those user-agent strings and determine if the user is a search engine or not. I use something like this often:
$searchengines = array(
'Googlebot',
'Slurp',
'search.msn.com',
'nutch',
'simpy',
'bot',
'ASPSeek',
'crawler',
'msnbot',
'Libwww-perl',
'FAST',
'Baidu',
);
$is_se = false;
foreach ($searchengines as $searchengine){
if (!empty($_SERVER['HTTP_USER_AGENT']) and
false !== strpos(strtolower($_SERVER['HTTP_USER_AGENT']), strtolower($searchengine)))
{
$is_se = true;
break;
}
}
if ($is_se) { print('Its a search engine!'); }
Remember that no detection method (Google Analytics or another statistics package or otherwise) is going to be 100% accurate. Some web browsers allow you to set a custom user-agent string, and some misbehaving web crawlers may not send a user-agent string at all. This method can be probably effective for 95%+ of crawlers/visitors though.