views:

22

answers:

3

I am storing my sitemaps in my web folder. I want web crawlers (Googlebot etc) to be able to access the file, but I dont necessarily want all and sundry to have access to it.

For example, this site (stackoverflow.com), has a site index - as specified by its robots.txt file (http://stackoverflow.com/robots.txt).

However, when you type http://stackoverflow.com/sitemap.xml, you are directed to a 404 page.

How can I implement the same thing on my website?

I am running a LAMP website, also I am using a sitemap index file (so I have multiple site maps for the site). I would like to use the same mechanism to make them unavailable via a browser, as described above.

A: 

You can check the user-agent header the client sends, and only pass the sitemap to known search bots. However, this is not really safe since the user-agent header is easily spoofed.

Sjoerd
A: 

Stack Overflow presumably checks two things when deciding who gets access to the sitemaps:

  • The USER_AGENT string
  • The originating IP address

both will probably be matched against a database of known legitimate bots.

The USER_AGENT string is pretty easy to check in a server side language; it is also very easy to fake. More info:

Pekka
A: 

First, decide which networks you want to get your actual sitemap.

Second, configure your web server to grant requests from those networks for your sitemap file, and configure your web server to redirect all other requests to your 404 error page.

For nginx, you're looking to stick something like allow 10.10.10.0/24; into a location block for the sitemap file.

For apache, you're looking to use mod_authz_host's Allow directive in a <Files> directive for the sitemap file.

sarnold
@sarnold: this is definitely the way I want to go. user-agents are pretty easy to fake, so this has some appeal. I know its by no means a 'magic silver bullet', but I think it is (at least, marginally) more robust than server side logic involving user-agent strings. Could you please provide an example that will allow access to sitemap-index.xml and *.gz files in the web folder if the request is from google.com?
morpheous
@Morpheous, the trick is finding the networks -- google crawls from googlebot.com, and who knows if they are kind enough to stick to single netblocks or if they use dozens of netblocks. I'd suggest looking through your logs and figuring out which ones you want to allow and which you want to deny.
sarnold