ansaurus

Question

ASP.NET MVC - Block all visitors to a specific controller except search bots (Googlebot, Yahoo Slurp, etc).

Answer 1

+4 A:

You can probably create a filter attribute that rejects the request using the User Agent header. The usefulness of this is questionable(and is not a security feature) as the header can be easily faked, but it will stop people doing it in a stock browser.

This page contains a list of user agent strings that googlebot uses.

Sample code to redirect non-googlebots to a 404 action on an error controller:

[AttributeUsage(AttributeTargets.Method, AllowMultiple = false)]
public class BotRestrictAttribute : ActionFilterAttribute {

    public override void OnActionExecuting(ActionExecutingContext c) {
      if (c.RequestContext.HttpContext.Request.UserAgent != "Googlebot/2.1 (+http://www.googlebot.com/bot.html)") {
        c.Result = RedirectToRouteResult("error", new System.Web.Routing.RouteValueDictionary(new {action = "NotFound", controller = "Error"}));
      }
    }
}

EDIT To respond to comments. If server load is an issue for your sitemap, restricting access to the bots might not be sufficient. Googlebot by itself has the ability to grind your server to a halt if it decides to scrape aggressively. You should probably cache the response as well. You can use the same FilterAttribute and Application.Cache for that.

Here is a very rough example, might need tweaking with propert HTTP headers:

[AttributeUsage(AttributeTargets.Method, AllowMultiple = false)]
public class BotRestrictAttribute : ActionFilterAttribute {

    public const string SitemapKey = "sitemap";

    public override void OnActionExecuting(ActionExecutingContext c) {
      if (c.RequestContext.HttpContext.Request.UserAgent != "Googlebot/2.1 (+http://www.googlebot.com/bot.html)") {
        c.Result = RedirectToRouteResult("error", new System.Web.Routing.RouteValueDictionary(new {action = "NotFound", controller = "Error"}));
        return;
      }

      var sitemap = Application.Cache[SitemapKey];
      if (sitemap != null) {
        c.Result = new ContentResult { Content = sitemap};
        c.HttpContext.Response.ContentType = "application/xml";
      }

    }
}

//In the sitemap action method
string sitemapString = GetSitemap();
HttpContext.Current.Cache.Add(
 BotRestrictAttribute.SitemapKey, //cache key
 sitemapString, //data
 null, //No dependencies
 DateTime.Now.AddMinutes(1), 
 Cache.NoSlidingExpiration, 
 CacheItemPriority.Low, 
 null //no callback
);

Igor Zevaka 2010-08-23 00:33:22

I have read that we are to absolutely **not** use the UserAgent because of spoofing. I suppose it doesn't matter that much tho since the content isn't sensitive. Hmmm.

rockinthesixstring 2010-08-23 00:35:50

I am not sure what else you can use to identify bots. Bot Ip address? some sort of heuristic analyzing patterns of the visits? It gets tricky very very fast. What's so wrong in making the sitemap available to users anyway?

Igor Zevaka 2010-08-23 00:43:27

Your method is most probably the method I will use (as I said before, it's not a security need), just a way to prevent too much DB load from legit users. This being said however, I did read an article on [Detecting GoogleBot using Reverse DNS](http://googlewebmastercentral.blogspot.com/2006/09/how-to-verify-googlebot.html).

rockinthesixstring 2010-08-23 01:16:06

See my answer below for my variation on this. Works like a champ!!! http://stackoverflow.com/questions/3544043/asp-net-mvc-block-all-visitors-to-a-specific-controller-except-search-bots-goo/3544662#3544662

rockinthesixstring 2010-08-23 05:32:12

@Igor - you asked "What's so wrong in making the sitemap available to users" - I suppose it's two fold. 1 being that it's a lot of DB load when my sitemap will have tens of thousands of records (similar to SO), and 2 being that I don't really want other web applications crawling through my site (taking content).

rockinthesixstring 2010-08-23 16:36:25

Thanks for the edit Igor. I think I'd rather implement separate caching on that controller. I have another FilterAttribute that caches controllers.

rockinthesixstring 2010-08-24 22:22:11

Answer 2

A:

Another thing you can use is DNS Lookups which is explain here Verifying Googlebot

You can add a reverse DNA lookup in your ViewEngine.

rob waminal 2010-08-23 01:04:17

Answer 3

+3 A:

I'm using Igor's solution with a bit of a twist.

First, I've got the following Browser file

SearchBot.browser

<browsers>
    <browser id="Slurp" parentID="Mozilla">
        <identification>
            <userAgent match="Slurp" />
        </identification>
        <capabilities>
            <capability name="crawler" value="true" />
        </capabilities>
    </browser>
    <browser id="Yahoo" parentID="Mozilla">
        <identification>
            <userAgent match="http\:\/\/help.yahoo.com\/help\/us\/ysearch\/slurp" />
        </identification>
        <capabilities>
            <capability name="crawler" value="true" />
        </capabilities>
    </browser>
    <browser id="Googlebot" parentID="Mozilla">
        <identification>
            <userAgent match="Googlebot" />
        </identification>
        <capabilities>
            <capability name="crawler" value="true" />
        </capabilities>
    </browser>
    <browser id="msnbot" parentID="Mozilla">
        <identification>
            <userAgent match="msnbot" />
        </identification>
        <capabilities>
            <capability name="crawler" value="true" />
        </capabilities>
    </browser>
</browsers>

Then an ActionFilterAttribute

Imports System.Web.Mvc
Imports System.Net
Imports System.Web

Namespace Filters
    <AttributeUsage(AttributeTargets.Method, AllowMultiple:=False)> _
    Public Class SearchBotFilter : Inherits ActionFilterAttribute

        Public Overrides Sub OnActionExecuting(ByVal c As ActionExecutingContext)
            If Not HttpContext.Current.Request.Browser.Crawler Then
                HttpContext.Current.Response.StatusCode = CInt(HttpStatusCode.NotFound)
                c.Result = New ViewResult() With {.ViewName = "NotFound"}
            End If
        End Sub
    End Class
End Namespace

And finally my Controller

    <SearchBotFilter()> _
    Function Index() As ActionResult
        Return View()
    End Function

Thanks Igor, it's a great solution.

rockinthesixstring 2010-08-23 03:56:25

Looks good, good use of the browser file.

Igor Zevaka 2010-08-23 05:44:23

ansaurus

tags:

views:

answers:

ASP.NET MVC - Block all visitors to a specific controller except search bots (Googlebot, Yahoo Slurp, etc).

How can I block all visitors to a controller EXCEPT for search bots?

related questions