views:

659

answers:

6

For some reason when I check on Google Webmaster Tool's "Analyze robots.txt" to see which urls are blocked by our robots.txt file, it's not what I'm expecting. Here is a snippet from the beginning of our file:

Sitemap: http://[omitted]/sitemap_index.xml

User-agent: Mediapartners-Google
Disallow: /scripts

User-agent: *
Disallow: /scripts
# list of articles given by the Content group
Disallow: http://[omitted]/Living/books/book-review-not-stupid.aspx
Disallow: http://[omitted]/Living/books/book-review-running-through-roadblocks-inspirational-stories-of-twenty-courageous-athletic-warriors.aspx
Disallow: http://[omitted]/Living/sportsandrecreation/book-review-running-through-roadblocks-inspirational-stories-of-twenty-courageous-athletic-warriors.aspx

Anything in the scripts folder are correctly blocked for both the Googlebot and Mediapartners-Google. I can see that the two robots are seeing the correct directive because the Googlebot says the scripts are blocked from line 7 while the Mediapartners-Google is blocked from line 4. And yet ANY other url I put in from the disallowed urls under the second user-agent directive are NOT blocked!

I'm wondering if my comment or using absolute urls are screwing things up...

Any insight is appreciated. Thanks.

A: 

Did you recently make this change to your robots.txt file? In my experience it seems that google caches that stuff for a really long time.

Webjedi
+1  A: 

It's been up for at least a week, and Google says it was last downloaded 3 hours ago, so I'm sure it's recent.

You're probably better off editing the original question (typically by putting EIDT in bold at the bottom followed by the extra information) rather than answering your own question (I realize you can't comment yet).
cletus
A: 

Check this out: Checking robots.txt

mjy
+1  A: 

It's the absolute URLs. robots.txt is only supposed to include relative URIs; the domain is inferred based on the domain that the robots.txt was accessed from.

David Zaslavsky
+8  A: 

The reason why they are ignored is that you have the fully qualified URL in the robots.txt file while the specification doesn't allow it. (You should only specify relative paths, or absolute paths using /). Try the following:

Sitemap: /sitemap_index.xml

User-agent: Mediapartners-Google
Disallow: /scripts

User-agent: *
Disallow: /scripts
# list of articles given by the Content group
Disallow: /Living/books/book-review-not-stupid.aspx
Disallow: /Living/books/book-review-running-through-roadblocks-inspirational-stories-of-twenty-courageous-athletic-warriors.aspx
Disallow: /Living/sportsandrecreation/book-review-running-through-roadblocks-inspirational-stories-of-twenty-courageous-athletic-warriors.aspx

As for caching, google tries to get a copy of the robots.txt file every 24 hours in average.

Andrew Moore
Is that first line correct? http://www.sitemaps.org/protocol.php#submit_robots indicates that the sitemap location should be the complete URL.
David Citron
Site map with full URL is ok, but your disallow lists should still be absolute.
Andrew Moore
+1  A: 

Yup, it was the absolute urls. Thanks!