views:

472

answers:

5

Is there any way to detect search engines or crawlers on my site. i have seen in phpBB at the admin we can see and allow search engines and also we can see the last visit of the bot(like Google Bot).

any script in PHP? Not Google Analytic or same kind of application. i need to implement that for my blog site, i think there is some way to find out?

A: 

Use Google Analytics. Copy & Paste the Javascript onto your site, and you're set. It's the best (free) tool like that by far.

Matt Grande
yes that i already implemented, but am talking about the other kind of script which i want to know how to track "Bots".i think you don't understand my Question.
coderex
A: 

Alternatives to Google Analytics which allow you to keep your data:

Piwik is implemented in PHP and operates in the same way as GA. The other two are Web server log analyzers.

oggy
A: 

Google Analytics will let you see the crawl stats for your site.

Galwegian
+3  A: 

You can go by either IP addresses or the 'User-Agent' string that the bot or web browser sends you.

When Googlebot (or most other well-behaving robots) visit your website, they'll send you a $_SERVER['HTTP_USER_AGENT'] variable which identifies what they are. Some examples are:

Googlebot/2.1 (+http://www.google.com/bot.html)

NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html

Baiduspider+(+http://www.baidu.com/search/spider_jp.html)

Mozilla/5.0 (X11; U; Linux i686; en-US) AppleWebKit/531.4 (KHTML, like Gecko)

You can find many more examples at these websites: link text link text

You could then use PHP to examine those user-agent strings and determine if the user is a search engine or not. I use something like this often:

$searchengines = array(
    'Googlebot', 
    'Slurp', 
    'search.msn.com', 
    'nutch', 
    'simpy', 
    'bot', 
    'ASPSeek', 
    'crawler', 
    'msnbot', 
    'Libwww-perl', 
    'FAST', 
    'Baidu', 
    );
$is_se = false;
foreach ($searchengines as $searchengine){
   if (!empty($_SERVER['HTTP_USER_AGENT']) and 
            false !== strpos(strtolower($_SERVER['HTTP_USER_AGENT']), strtolower($searchengine)))
    {
            $is_se = true;
            break;
    }
}
if ($is_se) { print('Its a search engine!'); }

Remember that no detection method (Google Analytics or another statistics package or otherwise) is going to be 100% accurate. Some web browsers allow you to set a custom user-agent string, and some misbehaving web crawlers may not send a user-agent string at all. This method can be probably effective for 95%+ of crawlers/visitors though.

Keith Palmer
I think this was am looking....
coderex
+2  A: 
  1. You can try to detect them using their user-agent string. A list of them can be found here: http://www.botsvsbrowsers.com/

    Search engines tend to use the words crawler and robot.

  2. Search engines are almost the only internet user that visit robots.txt.

  3. There are some IPs known to be bots like the GoogleBot.

Georg