ansaurus

Question

Regex for robots.txt - disallow something within a directory, but not the directory itself...

Answer 1

A:

William 2010-09-20 01:11:09

That's what I was sort of wondering... will the * require something in that next directory step, and not match to the directory itself (/category-name/ is allowed in that example?) Sorry, I'm totally new to this!

Jeff 2010-09-20 01:13:43

Please see the revised answer about the use of `<meta>`.

William 2010-09-20 01:25:53

I'm thinking that the first solution might work, because I don't need to allow anything within a given directory, I just want to make sure the directory itself is reachable (which it should be, right? If I'm correct, /*/ would only match if there was actually something after the category name?) The only problem with the robots metatag is that I have a couple thousand posts, and deployment would be a real project.

Jeff 2010-09-20 01:50:45

Answer 2

A:

Some information that might help.

There is no official standards body or RFC for the robots.txt protocol. It was created by consensus in June 1994 by members of the robots mailing list ([email protected]). The information specifying the parts that should not be accessed is specified in a file called robots.txt in the top-level directory of the website. The robots.txt patterns are matched by simple substring comparisons, so care should be taken to make sure that patterns matching directories have the final '/' character appended, otherwise all files with names starting with that substring will match, rather than just those in the directory intended.

There’s no 100% sure way to exclude your pages from being found, other than not to publish them at all, of course.

See: http://www.robotstxt.org/robotstxt.html

There is no Allow in the Consensus. Plus the Regex option is not in the Consensus either.

From the Robots Consensus:

This is currently a bit awkward, as there is no "Allow" field. The easy way is to put all files to be disallowed into a separate directory, say "stuff", and leave the one file in the level above this directory:

User-agent: *
Disallow: /~joe/stuff/

Alternatively you can explicitly disallow all disallowed pages:

User-agent: *
Disallow: /~joe/junk.html
Disallow: /~joe/foo.html
Disallow: /~joe/bar.html

A Possible Solution:

Use .htaccess to set to disallow search robots from a specific folder while blocking bad robots.

See: http://www.askapache.com/htaccess/setenvif.html

Todd Moses 2010-09-20 03:32:46

ansaurus

tags:

views:

answers:

Regex for robots.txt - disallow something within a directory, but not the directory itself...

related questions