views:

85

answers:

3

I need to have control over what URLs are allowed to be indexed. To do this I want to allow google to index only URLs that are listed in my Sitemap(s), and disallow Google from indexing anything else.

Easiest way to solve this is if there is a way to configure robots.txt to disallow everything:

User-agent: *

Disallow: /

And at the same time allow every URL that is listed in:

Sitemaps: sitemap1.xml

Sitemaps: sitemap2.xml

Can the robots.txt be configured to do this? Or are there any other workarounds?

A: 

By signing into http://www.google.com/webmasters/ you can submit sitemaps directly to google's search engine.

x0n
Google will still index pages that is not in the SiteMap. This is what I want to prevent. Besides (as a side note), for search engines like Baidu.com there is no place to submit your sitemap. They only find sitemaps that is listed in your robots.txt file
Joakim
well then you need to autogenerate your robots.txt file from your sitemaps. There is no relation between the two technologies.
x0n
What if my robots.txt have 1,000,000 entries, will that cause any problems?
Joakim
You'd have to ask Google about that one. I'd imagine a 30MB+ robots.txt file would probably be ignored.
x0n
+1  A: 

You will have to add an Allow entry for each element in the sitemap. This is cumbersome, but it's easy to do something programmatically with something that reads in the sitemap, or if the sitemap is being created progarmmatically itself, then base it on the same code.

Note that Allow is an extension to the robots.txt protocol, and not supported by all search-engines, though it is supported by google.

Jon Hanna
I have a dynamic robots.txt file that prints it content from the database, so this is a possible solution that I have thought about. But what if this list of allow (or disallow) is 100,000 entries, or 1,000,000 entries for that matter. Will it cause any problems if the robots.txt file is extremely huge?
Joakim
I honestly don't know. Whether it did or not though, I'd look at structuring the URI associations so that a few Disallow statements are all you need in robots.txt. Or else just allowing them to be indexed anyway (if being indexed isn't actively bad for some reason, then it's normally good, even if its not a priority to you).
Jon Hanna
It`s hard to explain this issue in such details in a comment field like this. Here`s a crash course in a few sentences; In my case we have a domain per customer that are sharing the same website. Content that belongs to domain A is separated in domain A`s sitemap. But Google don`t care about this and finds pages/content that belongs to Domain B and "attaches" this to domain A. So... the result of this is that in google search result we get results to the same page on multiple domains. This is what we need to prevent.
Joakim
I'd aim at blocking it from being sent at all in this case. If `http://domainA/pathB` shouldn't be sent, and you can't just split them off into different applications, then have it send a 404 in that case. Then not only will google not index it, but no other way of getting to it will be poss. either. Something in a base class for pages, in global.asax.cs or a HttpModule could catch these cases.
Jon Hanna
the content on different domains are suppose to be available for visitors so I can`t block it completely. But i`ll look into checking user-agent and send only google to a 404 page if the page belongs to a different domain.
Joakim
I'd be careful about any case where you send something different to google than elsewhere as it can look like you're trying to game rankings.
Jon Hanna
Then i`m more or less back to 0.Any idea what this 2 lines in robots.txt will do?Disallow: /Sitemap: sitemap.xml
Joakim
That'll say where the sitemap is, but also block everything. I'm afraid you're down to the massive robots.txt approach if you can't restructure the site better.
Jon Hanna
A: 

It's not a robots.txt related answer, it's related to the Robots protocol as a whole and I used this technique extremely often in the past, and it works like a charm.

As far as I understand your site is dynamic, so why not make use of the robots meta tag? As x0n said, a 30MB file will likely create issues both for you and the crawlers plus appending new lines to a 30MB files is an I/O headache. Your best bet, in my opinion anyway, is to inject into the pages you don't want indexed something like:

<META NAME="ROBOTS" CONTENT="NOINDEX" />

The page would still be crawled, but it won't be indexed. You can still submit the sitemaps through a sitemap reference in the robots.txt, you don't have to watch out to not include in the sitemaps pages which are robotted out with a meta tag, and it's supported by all the major search engines, as far as I remember by Baidu as well.

methode
nice and simple. Thanks a lot, it`s going to cost me a lot of hours to implement the way I want it though so I better get startet :))
Joakim