views:

168

answers:

4

I would like to automatically detect Google and other Crawlers and log them into my ASP.NET website. Has anyone found a reliable way to do this? The Login part is easy, however to reliably detect them is the real issue.

Regards.

+3  A: 

This seems like a really bad idea for several reasons, not the least of which is that Google will cache copies of your pages, so that even if I do not authenticate against your site, I will be able to see the content of web pages and other documents served from behind the protected portion of your web site.

As far as detecting web crawlers goes, I wouldn't trust any User Agent. You could probably compile a list of IP addresses the crawlers originate from, but as soon as Google adds another IP address, you will be denying that crawler access.

Doing a reverse DNS lookup on every request to ensure the domain of the visitor is googlebot.com as suggested at Verifying Googlebot could be a big performance hit if your site is busy.

Grant Wagner
+1  A: 

Look at the user-agent. It should be googlebot. A more reliable way is to do a reverse lookup of the IP adddress. It will tell you if it is a google bot or not. I'd use both methods. BUT NOTE: You will slow your site since you will be doing a reverse lookup for every visitor.

http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=80553

Byron Whitlock
Both answers re good, however you came in first.
Nissan Fan
thanks, there sure are a lot of smart peeps on stackoverflow!
Byron Whitlock
A: 

You don't need to do a reverse lookup on every request. Cache the results of the reverse lookup. I just went through my logs, and I see long runs of Googlebot coming from the same IP. That behavior isn't guaranteed but in any event caching should be a sound strategy.

davidjbullock
A: 

You can easily direct Google to index, but not cache your site pages using the noarchive metatag. That way you gain the benefits of being searchable without exposing your content.

See this page at Google Webmaster central for more information

http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=156412

davidjbullock