tags:

views:

48

answers:

1

My URL structure is set up in two parallels (both lead to the same place ):

  • www.mydomain.com/subname
  • www.mydomain.com/123

The trouble is is that, the spiders are crawling into things like:

  • www.mydomain.com/subname/default_media_function
  • www.mydomain.com/subname/map_function

Note that the name "subname" represents thousands of different pages on my site that all have that same function.

And they are erroring out because those links are strictly for JSON or AJAX purposes and not actual links. I would like to block them from accessing those pages, but how would I do that if the URL contains a variable?

Would this work in robots.txt?

Disallow: /map_function
+1  A: 

you are gonna have to do Disallow: /subname/map_function

the robots will look for the robot.txt at root level, also there they evaluation urls left to right with no wildcards.

so you will either need to make one location for all the map_function and exclude that or exclude all locations.

Isisagate
Is it common for people to produce robot scripts dynamically? Because that var 'subname' represents thousands of different pages each with the same function. I'm just worried about making a huge robot.txt
Trip
I don't think you will need to worry too much about the size. Your best bet is to find a way to put all the elements into a single directory and then exclude the directory...Disallow: /subname/restrict/or something like that it will save you hassle on the robot.txt. However you can also do it dynamically how it's created doesn't matter to the bot.
Isisagate
so this is a mod_rewrite issue I take it. You might looking into using # to specify your parametersjquery incorperates it. I think the bots wouldn't pay attention to anchors
Isisagate
Can I do what @Kaaviar said above? Just Disallow: /map_function/ or */map_function/ ?
Trip
* is a wildcard, this may work on google or larger engine but not smaller ones as it isn't part of the standard. The Search engines read the robot.txt stuff from left to right.so /map_function/ is the same as http://www.domainname.com/map_function/So that won't stop it. That's why if you can find a way to consolidate the map_function it's it's own directory or path that is the same across the subnames then you can prevent then from index it easier by simply restrict that one path.
Isisagate