tags:

views:

1687

answers:

4

How can one detect the search engine bots using php??

A: 

You could analyse the user agent ($_SERVER['HTTP_USER_AGENT']) or compare the client’s IP address ($_SERVER['REMOTE_ADDR']) with a list of IP addresses of search engine bots.

Gumbo
+5  A: 

Here's a Search Engine Directory of Spider names

Then you use $_SERVER['HTTP_USER_AGENT']; to check if the agent is said spider.

if(strstr(strtolower($_SERVER['HTTP_USER_AGENT']), "googlebot"))
{
    // what to do
}
Ólafur Waage
if ((eregi("yahoo",$this->USER_AGENT)) $this->Type = "robot"; }will this work fine??
terrific
+5  A: 

Check the $_SERVER['HTTP_USER_AGENT'] for some of the strings listed here:

http://www.useragentstring.com/pages/All/

Or more specifically for crawlers:

http://www.useragentstring.com/pages/Crawlerlist/

If you want to -say- log the number of visits of most common search engine crawlers, you could use

$interestingCrawlers = array( 'google', 'yahoo' );
$pattern = '/(' . implode('|', $interestingCrawlers) .')/';
$matches = array();
$numMatches = preg_match($pattern, strtolower($_SERVER['HTTP_USER_AGENT']), $matches, 'i');
if($numMatches > 0) // Found a match
{
  // $matches[1] contains an array of all text matches to either 'google' or 'yahoo'
}
Jukka Dahlbom
+1  A: 
 <?php // IPCLOACK HOOK
if (CLOAKING_LEVEL != 4) {
    $lastupdated = date("Ymd", filemtime(FILE_BOTS));
    if ($lastupdated != date("Ymd")) {
     $lists = array(
     'http://labs.getyacg.com/spiders/google.txt',
     'http://labs.getyacg.com/spiders/inktomi.txt',
     'http://labs.getyacg.com/spiders/lycos.txt',
     'http://labs.getyacg.com/spiders/msn.txt',
     'http://labs.getyacg.com/spiders/altavista.txt',
     'http://labs.getyacg.com/spiders/askjeeves.txt',
     'http://labs.getyacg.com/spiders/wisenut.txt',
     );
     foreach($lists as $list) {
      $opt .= fetch($list);
     }
     $opt = preg_replace("/(^[\r\n]*|[\r\n]+)[\s\t]*[\r\n]+/", "\n", $opt);
     $fp =  fopen(FILE_BOTS,"w");
     fwrite($fp,$opt);
     fclose($fp);
    }
    $ip = isset($_SERVER['REMOTE_ADDR']) ? $_SERVER['REMOTE_ADDR'] : '';
    $ref = isset($_SERVER['HTTP_REFERER']) ? $_SERVER['HTTP_REFERER'] : '';
    $agent = isset($_SERVER['HTTP_USER_AGENT']) ? $_SERVER['HTTP_USER_AGENT'] : '';
    $host = strtolower(gethostbyaddr($ip));
    $file = implode(" ", file(FILE_BOTS));
    $exp = explode(".", $ip);
    $class = $exp[0].'.'.$exp[1].'.'.$exp[2].'.';
    $threshold = CLOAKING_LEVEL;
    $cloak = 0;
    if (stristr($host, "googlebot") && stristr($host, "inktomi") && stristr($host, "msn")) {
     $cloak++;
    }
    if (stristr($file, $class)) {
     $cloak++;
    }
    if (stristr($file, $agent)) {
     $cloak++;
    }
    if (strlen($ref) > 0) {
     $cloak = 0;
    }

    if ($cloak >= $threshold) {
     $cloakdirective = 1;
    } else {
     $cloakdirective = 0;
    }
}
?>

That would be the ideal way to cloak for spiders. It's from an open source script called [YACG] - http://getyacg.com

Needs a bit of work, but definitely the way to go.

L. Cosio