views:

428

answers:

4

Google's Webmaster guidelines state

Allow search bots to crawl your sites without session IDs or arguments that track their path through the site. These techniques are useful for tracking individual user behavior, but the access pattern of bots is entirely different. Using these techniques may result in incomplete indexing of your site, as bots may not be able to eliminate URLs that look different but actually point to the same page.

My ASP.NET 1.1 site uses custom authentication/authorization and relies pretty heavily on session guids (similar to this approach). I'm worried that allowing non-session tracked traffic will either break my existing code or introduce security vulnerabilities.

What best practices are there for allowing non-session tracked bots to crawl a normally session tracked site? And are there any ways of detecting search bots other than inspecting the user agent (i don't want people to spoof themselves as googlebot to get around my session tracking)?

+2  A: 

The correct way to detect bots is by host entry (Dns.GetHostEntry). Some lame robots require you to track by ip address, but the popular ones generally don't. Googlebot requests come from *.googlebot.com. After you get the host entry, you should check in the IPHostEntry.AddressList to make sure it contains the original ip address.

Do not even look at the user agent when verifying robots.

See also http://googlewebmastercentral.blogspot.com/2006/09/how-to-verify-googlebot.html

Brian
nice link. doing DNS lookups on every web request sounds expensive though. will have to try it i guess
qntmfred
@qntmfred: You can cache those results. In a pinch, you can reserve lookups for cases where the user agent is from a search engine or where session state is missing. That being said, you're probably only going to be paying 100ms or so per lookup, and that's not cpu time so it will only be paid by the user who is requiring the lookup. Also, you are only doing it on web requests for pages. Do *NOT* do it on, for example, web requests for css files.
Brian
99% of my requests are not going to be bots, so caching wouldn't help much. but only doing the dns lookup on cases where the user-agent indicates a search engine should do the trick
qntmfred
@qntmfred: Some search engines use page-loading speed to influence results. You might still want to use caching.
Brian
oh, i see. for the user-agents that say they're a search engine, do the dns lookup and cache that ip. not cache the ip/dns for every single request. right
qntmfred
as noted in my answer, there is no need to strictly verify googlebot. stripping session IDs is nothing worth hiding from googlebot-impersonators. that's something completely different than allowing access to protected content as required by google's first click free for instance.
sfussenegger
Considering "*.googlebot.com" as DNS from google is completely unsafe. It is possible for you to create MyOwnDNS.googlebot.com
Gladwin Burboz
@Gladwin: That's news to me.
Brian
+1  A: 

First of all: We had some issues with simply stripping JSESSIONIDs from responses to known search engines. Most notably, creating a new session for each request caused OutOfMemoryErrors (while you're not using Java, keeping state for thousands of active sessions certainly is a problem for most or all servers/frameworks). This might be solved by reducing session timeout (for bot sessions only - if possible). So if you'd like to go down this path, be warned. And if you do, no need to do DNS lookups. You aren't protecting anything valuable here (compared to Google's First Click Free for instance). If somebody pretends to be a bot that should normally be fine.

Instead, I'd rather suggest to keep tracking sessions (using URL parameters as a fallback for cookies) and add a canonical link tag (<link rel="canonical" href="..." />, obviously without the session id itself) to each page. See "Make Google Ignore JSESSIONID" or an extensive video featuring Matt Cutts for discussion. Adding this tag isn't very intrusive and could possibly be considered good practice anyway. So basically you would end without any dedicated handling of search engine spiders - which certainly is a Good Thing (tm).

sfussenegger
A: 

I believe, your approach to the problem is not quite right. You shouldn't rely on session tracking mechanism to decide on access rights, to log malicious users, to detect bots etc.

  1. If you don't want arbitrary users to access certain pages, you should use authentication and authorization. If arbitrary users are allowed to access the page at all, they should be allowed to do it without any session ID (as if it is the first page they visit) - so, the bots will also be able to crwal these pages without any problems.

  2. Malicious users, most likely, could circumvent your session tracking by not using (or tweaking) cookies, referers, URL parameters etc. So, session tracking could not be reliably used here, do just plain logging of any request with its originating IP. Later you could analyze the collected data to detect suspicious activity, try to find users with multiple IPs etc. These analysis is complex and should not be done at runtime.

  3. To detect bots, you could do a reverse DNS lookup on the collected IPs. Again, this could be done offline, so no performance penalty. Generally, the content of the page served should not depend whether the visitor is a bot or an unaunthenticated human user (search engines rightfully treat such behaviour as cheating).

VladV
A: 

If spoofing is your main concern, you're doing security wrong. You shouldn't give robots any more permissions than users, quite the opposite (hence why users get logins and robots get robots.txt).

If you're going to give someone special privileges without authentication, it is inherently open for spoofing. IPs can be spoofed. Server-Client communication can be spoofed. And so on.

If you rely on tracking cookies to analyse malicious behaviour, you need to fix that. It should be easy enough to get a good understanding without requesting that the malicious user identify him/herself.

IPs aren't a good substitute for authentication, but they are good enough for grouping if cookies aren't available. Besides, you should be using more reliable means (i.e. a combination of factors) in the first place.

Alan
i don't give robots more permissions. they have the same permissions as any other anonymous user. however, I track anonymous users page visits via their session, and I don't want to issue sessions to robots or track their visits.
qntmfred
You considered the possibility of malicious users not being tracked with cookies as a security concern. Therefore this special treatment for bots seems to be a privilege. It shouldn't be a security problem in the first place, that's all I'm saying.
Alan