tags:

views:

1593

answers:

8

AFAIK,

$_SERVER['REMOTE_HOST'] should end up with "google.com" or "yahoo.com".

but is it the most ensuring method?

any other way out?

+7  A: 

You identify search engines by user agent and IP address. More info can be found in How to identify search engine spiders and webbots. It's also worth noting this list. You shouldn't treat user agents (or even remote hosts) as necessarily definitive however. User agents are really nothing more than what the other end tells you it is and it is of course free to tell you anything. It's trivial to write code to pretend to be Googlebot.

In PHP, this means looking at $_SERVER['HTTP_USER_AGENT'] and $_SERVER['REMOTE_HOST'].

There are a lot of search engines but honestly it's only the big few you really care about generally speaking. Google and Yahoo together have almost all of the market. But of course it depends on what you're trying to achieve.

Note: be very careful of treating search engines differently to normal users (like the "evil hyphen site" as Joel put it) when it comes to content. In particularly egregious cases, this could get your site removed from that search engine. Even if that doesn't happen you will probably put some users off who go to a site expecting something. If they're then presented with a "Please register to see this article" box instead, well, gratz on your high bounce rate.

cletus
but good luck with this, not many crawlers play nice with user-agent
annakata
But Google and Yahoo do.
Peter Stuifzand
User agent strings can be spoofed.
Gumbo
"gratz on your high bounce rate" I lol'd :P Sound advice though re:googlebanning
annakata
+4  A: 

You are probably better off using $_SERVER['HTTP_USER_AGENT'] and look for Googlebot or Yahoo! Slurp.

Chris Bartow
A: 

The best way to do it with well know and behaving robots, like those you mentioned, is by user agent which you can find on $_SERVER['HTTP_USER_AGENT'].

J. Pablo Fernández
+1  A: 

I dont think crawlers comes from google.com and I know some other people you don't want to treat as bots that comes from there. All who search for your site.

What you need to do is take a look at the IP of the different bots. http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=80553

The real napster
+5  A: 

First of all, I hope you're not trying to do this in order to serve search engine bots different content than your site contains for normal users. If they discover you doing this, your site will get removed from their listings entirely. So long as you understand the risks of it, you can usually find information about what unique user-agent they will use:

  • Verifying Googlebot (use user-agent, reverse DNS if you want to be sure)
  • Yahoo's user agent will contain "Slurp"

However, some people writing (usually poorly-behaved) web scrapers will set their User Agent strings to be the same as "legitimate" crawlers such as Google's. You can catch these by doing lookups on the bot's IP address/hostname to ensure that they actually are coming from Google/Yahoo/etc. Some more info about what to look for in hostname lookups (from this article):

  • Google crawlers will end with googlebot.com like in crawl-66-249-70-244.googlebot.com.
  • Yahoo crawlers will end with crawl.yahoo.net like in llf520064.crawl.yahoo.net.
  • Live Search crawlers will end with search.msn.com like in msnbot-65-55-104-161.search.msn.com.
  • Ask crawlers will end with ask.com like in crawler4037.ask.com.
Chad Birch
Microsoft's Live search routinely does "QA" crawls with spoofed browser user agents. They even set the referrer to make it look like they've just come from a SERP. (more here: http://ekstreme.com/thingsofsorts/blogging/yell-if-microsofts-livecom-spammed-you-too).In short, even the major search engines will spoof headers when it suits them.
Frank Farmer
A: 
Silfverstrom
A: 
$_SERVER['HTTP_USER_AGENT']

Check various user agent strings here: http://www.user-agents.org/

NinethSense
A: 

Check this informative article on Google Crawl Intimation. It also appears to be complying the google's suggestion to identify it's googlebot.

PradeepKr