tags:

views:

87

answers:

3

I'm trying to create a controller for my sitemap, but only allow search engines to view it.

If you look at http://stackoverflow.com/robots.txt you'll see that their sitemap is http://stackoverflow.com/sitemap.xml. If you try to visit the sitemap, you'll be redirected to the 404 page.

This meta question confirms this behavior (answered by Jeff himself).

Now I don't want this question closed as "belongs on Meta", as I'm just using StackOverflow as an example. What I really need answered is...

How can I block all visitors to a controller EXCEPT for search bots?

+4  A: 

You can probably create a filter attribute that rejects the request using the User Agent header. The usefulness of this is questionable(and is not a security feature) as the header can be easily faked, but it will stop people doing it in a stock browser.

This page contains a list of user agent strings that googlebot uses.

Sample code to redirect non-googlebots to a 404 action on an error controller:

[AttributeUsage(AttributeTargets.Method, AllowMultiple = false)]
public class BotRestrictAttribute : ActionFilterAttribute {

    public override void OnActionExecuting(ActionExecutingContext c) {
      if (c.RequestContext.HttpContext.Request.UserAgent != "Googlebot/2.1 (+http://www.googlebot.com/bot.html)") {
        c.Result = RedirectToRouteResult("error", new System.Web.Routing.RouteValueDictionary(new {action = "NotFound", controller = "Error"}));
      }
    }
}

EDIT To respond to comments. If server load is an issue for your sitemap, restricting access to the bots might not be sufficient. Googlebot by itself has the ability to grind your server to a halt if it decides to scrape aggressively. You should probably cache the response as well. You can use the same FilterAttribute and Application.Cache for that.

Here is a very rough example, might need tweaking with propert HTTP headers:

[AttributeUsage(AttributeTargets.Method, AllowMultiple = false)]
public class BotRestrictAttribute : ActionFilterAttribute {

    public const string SitemapKey = "sitemap";

    public override void OnActionExecuting(ActionExecutingContext c) {
      if (c.RequestContext.HttpContext.Request.UserAgent != "Googlebot/2.1 (+http://www.googlebot.com/bot.html)") {
        c.Result = RedirectToRouteResult("error", new System.Web.Routing.RouteValueDictionary(new {action = "NotFound", controller = "Error"}));
        return;
      }

      var sitemap = Application.Cache[SitemapKey];
      if (sitemap != null) {
        c.Result = new ContentResult { Content = sitemap};
        c.HttpContext.Response.ContentType = "application/xml";
      }

    }
}

//In the sitemap action method
string sitemapString = GetSitemap();
HttpContext.Current.Cache.Add(
 BotRestrictAttribute.SitemapKey, //cache key
 sitemapString, //data
 null, //No dependencies
 DateTime.Now.AddMinutes(1), 
 Cache.NoSlidingExpiration, 
 CacheItemPriority.Low, 
 null //no callback
);
Igor Zevaka
I have read that we are to absolutely **not** use the UserAgent because of spoofing. I suppose it doesn't matter that much tho since the content isn't sensitive. Hmmm.
rockinthesixstring
I am not sure what else you can use to identify bots. Bot Ip address? some sort of heuristic analyzing patterns of the visits? It gets tricky very very fast. What's so wrong in making the sitemap available to users anyway?
Igor Zevaka
Your method is most probably the method I will use (as I said before, it's not a security need), just a way to prevent too much DB load from legit users. This being said however, I did read an article on [Detecting GoogleBot using Reverse DNS](http://googlewebmastercentral.blogspot.com/2006/09/how-to-verify-googlebot.html).
rockinthesixstring
See my answer below for my variation on this. Works like a champ!!! http://stackoverflow.com/questions/3544043/asp-net-mvc-block-all-visitors-to-a-specific-controller-except-search-bots-goo/3544662#3544662
rockinthesixstring
@Igor - you asked "What's so wrong in making the sitemap available to users" - I suppose it's two fold. 1 being that it's a lot of DB load when my sitemap will have tens of thousands of records (similar to SO), and 2 being that I don't really want other web applications crawling through my site (taking content).
rockinthesixstring
Thanks for the edit Igor. I think I'd rather implement separate caching on that controller. I have another FilterAttribute that caches controllers.
rockinthesixstring
A: 

Another thing you can use is DNS Lookups which is explain here Verifying Googlebot

You can add a reverse DNA lookup in your ViewEngine.

rob waminal
+3  A: 

I'm using Igor's solution with a bit of a twist.

First, I've got the following Browser file

SearchBot.browser

<browsers>
    <browser id="Slurp" parentID="Mozilla">
        <identification>
            <userAgent match="Slurp" />
        </identification>
        <capabilities>
            <capability name="crawler" value="true" />
        </capabilities>
    </browser>
    <browser id="Yahoo" parentID="Mozilla">
        <identification>
            <userAgent match="http\:\/\/help.yahoo.com\/help\/us\/ysearch\/slurp" />
        </identification>
        <capabilities>
            <capability name="crawler" value="true" />
        </capabilities>
    </browser>
    <browser id="Googlebot" parentID="Mozilla">
        <identification>
            <userAgent match="Googlebot" />
        </identification>
        <capabilities>
            <capability name="crawler" value="true" />
        </capabilities>
    </browser>
    <browser id="msnbot" parentID="Mozilla">
        <identification>
            <userAgent match="msnbot" />
        </identification>
        <capabilities>
            <capability name="crawler" value="true" />
        </capabilities>
    </browser>
</browsers>

Then an ActionFilterAttribute

Imports System.Web.Mvc
Imports System.Net
Imports System.Web

Namespace Filters
    <AttributeUsage(AttributeTargets.Method, AllowMultiple:=False)> _
    Public Class SearchBotFilter : Inherits ActionFilterAttribute

        Public Overrides Sub OnActionExecuting(ByVal c As ActionExecutingContext)
            If Not HttpContext.Current.Request.Browser.Crawler Then
                HttpContext.Current.Response.StatusCode = CInt(HttpStatusCode.NotFound)
                c.Result = New ViewResult() With {.ViewName = "NotFound"}
            End If
        End Sub
    End Class
End Namespace

And finally my Controller

    <SearchBotFilter()> _
    Function Index() As ActionResult
        Return View()
    End Function

Thanks Igor, it's a great solution.

rockinthesixstring
Looks good, good use of the browser file.
Igor Zevaka