tags:

views:

638

answers:

3

how to disallow all dynamic urls in robots.txt

Disallow: /?q=admin/
Disallow: /?q=aggregator/
Disallow: /?q=comment/reply/
Disallow: /?q=contact/
Disallow: /?q=logout/
Disallow: /?q=node/add/
Disallow: /?q=search/
Disallow: /?q=user/password/
Disallow: /?q=user/register/
Disallow: /?q=user/login/

i want to disallow all things that start with /?q=

A: 

Setting things to not show up in your sitemap.xml will not prevents robots from crawling your page. This looks like robots.txt formatting, so perhaps that's what your question is actually about?

That said, robots.txt is not very bright, and using wildcards in it tends to work sometimes and not others.

Paul McMillan
sorry, i gave a wrong tag it is infact robots.txt
pmarreddy
@Paul: see my answer to understand why wildcards break.
Stephen C
A: 

As Paul said a lot of robots.txt interpreters are not too bright and might not interpret wild-cards in the path as you intend to use them.

That said, some crawlers try to skip dynamic pages on their own, worrying they might get caught in infinite loops on links with varying urls. I am assuming you are asking this question because you face a courageous crawler who is trying hard to access those dynamic paths.

If you have issues with specific crawlers, you can try to investigate specifically how that crawler works by searching its robots.txt capacity and specifying a specific robots.txt section for it.

If you generally just want to disallow such access to your dynamic pages, you might want to rethink your robots.txt design.

More often than not, dynamic parameter handling "pages" are under a specific directory or a specific set of directories. This is why it is normally very simple to simply Disallow: /cgi-bin or /app and be done with it.

In your case you seem to have mapped the root to an area that handles parameters. You might want to reverse the logic of robots.txt and say something like:

User-agent: * 
Allow: /index.html
Allow: /offices
Allow: /static 
Disallow: /

This way your Allow list will override your Disallow list by adding specifically what crawlers should index. Note not all crawlers are created equal and you may want to refine that robots.txt at a later time adding a specific section for any crawler that still misbehaves.

Behrang Dadsetan
@Grookoo: "Allow:" is non-standard. A crawler that ignores it is not misbehaving; it is simply conforming to the robots.txt spec.
Stephen C
+5  A: 

The answer to your question is to use

Disallow: /?q=

The best (currently accessible) source on robots.txt I could find is on Wikipedia. (The supposedly definitive source is http://www.robotstxt.org, but site is down at the moment.)

According to the Wikipedia page, the standard defines just two fields; UserAgent: and Disallow:. The Disallow: field does not allow explicit wildcards, but each "disallowed" path is actually a path prefix; i.e. matching any path that starts with the specified value.

The Allow: field is a non-standard extension, and any support for explicit wildcards in Disallow would be a non-standard extension. If you use these, you have no right to expect that a (legitimate) web crawler will understand them.

This is not a matter of crawlers being "smart" or "dumb": it is all about standards compliance and interoperability. For example, any web crawler that did "smart" things with explicit wildcard characters in a "Disallow:" would be bad for (hypothetical) robots.txt files where those characters were intended to be interpreted literally.

Stephen C
@stephen thanks for the detailed answer
pmarreddy