tags:

views:

310

answers:

4

I need to disallow http://example.com/startup?page=2 search pages from being indexed. I want http://example.com/startup to be indexed but not http://example.com/startup?page=2 and page3 and so on.

edit: i forgot to add another detail here startup can be random http://example.com/XXXXX?page

Adv thanks

+2  A: 

I believe something like this should work out, though I have not actually tested it.

User-Agent: *
Disallow: /startup?page=
Disallow: page=
Disallow: ?page=

Disallow The value of this field specifies a partial URL that is not to be visited. This can be a full path, or a partial path; any URL that starts with this value will not be retrieved.

meder
thanks for the answer, i forgot to add another detail here startup can be random /XXXXX?page
pmarreddy
How many subdirectories do you have? You can try doing /?page= or add each subdirectory and do /subdirectory?page=
meder
i found this User-agent: *Disallow: /*page=is this right
pmarreddy
Try:Disallow: page= or Disallow: ?page=
meder
i added both, thanks
pmarreddy
can u add the comment text to the answer so that it will be helpful for other people
pmarreddy
+1  A: 

This should do the trick:

User-agent: *
Allow: /startup
Disallow: /startup?page=*
Adam
thanks for the answer, i forgot to add another detail here startup can be random /XXXXX?page
pmarreddy
Does the 'Allow:' even serve a purpose? And I don't think you need th wildcard.
meder
If for whatever reason he's disallowed /, then it'll re-allow it. Better safe than sorry.
Adam
+1  A: 

You can put this on the pages you do not want indexed:

<META NAME="ROBOTS" CONTENT="NONE">

This tells robots not to index the page.

On a search page, it may be more interesting to use:

<META NAME="ROBOTS" CONTENT="NOINDEX,FOLLOW">

This instructs robots to not index the current page, but still follow the links on this page, allowing them to get to the pages found in the search.

Phillip Knauss
+2  A: 
  1. Create a text file and name it: robots.txt
  2. Add user agents and disallow sections (see sample below)
  3. Place the file in the root of your site

Sample:

###############################
#My robots.txt file
#
User-agent: *
#
#list directories robots are not allowed to index 
#
Disallow: /testing/
Disallow: /staging/
Disallow: /admin/
Disallow: /assets/
Disallow: /images/
#
#
#list specific files robots are not allowed to index
#
Disallow: /startup?page=2
Disallow: /startup?page=3
Disallow: /startup?page=3
# 
#
#End of robots.txt file
#
###############################

Here's a link to Google's actual robots.txt file

You can get some good information on the Google webmaster's help topic on blocking or removing pages using a robots.txt file

Metro Smurf
thanks for the answer, i forgot to add another detail here startup can be random /XXXXX?page
pmarreddy
Using this method you'd have to manually add all the ?page=(number), you can leave that part off according to the spec.
meder