views:

261

answers:

2

Googlebot has been occasionally indexing one of our sites with a bad query string parameter. I am not sure how it is getting this query string parameter (there don't appear to be any sites linking to us with bad links, and nothing in our site is inserting the bad value). The bad parameter causes the site to throw a 500 error, as we expect.

I was under the impression that Google would not index pages that return a 500 error, but it turns out that it is. So now I have two questions:

1) Why would Googlebot be inserting random bad query string values? (I don't really care about the answer to this question, but if we could do something to avoid that, it would solve our problem.)

2) Why would Google index a page that returns a 500 error?

Here is one of the erroneous links that the Googlebot created and that Google has indexed:

http://www.pbs.org/teacherline/catalog/browse/?sa=4&gb=baqhuxts&gb=20&gb=21&num=20&page=2&js=0&sa=1

The bad parameter is gb=baqhuxts. The parameter 'gb' is expected to be an integer. If you remove that parameter from the query string you should get a nice catalog page showing.

Regarding nofollow and robots.txt solutions: [ REDACTED ]

I realize now that I am a moron and put a meta tag telling search robots to index the page. That was a dumb thing to do. I'm removing those. W-(

If you search on Google for 'baqhuxts' you will find that it has indexed 10 pages with this bad parameter. But each of these pages returns a 500 error. Does anyone have insight about why Google believes these are valid pages to index?

+1  A: 

unfortunately i know only the answer to #1:

google will crawl weird pages like that because people with google toolbar go to pages that dont exist, and their browsing information is transmitted to google. This is why you will often find pages indexed that have no business being indexed, for example, phpmyadmin pages that arent linked to from anywhere.

mkoryak
I understand that what you describe happens, and that can indeed be a tricky thing. But please see the updated question - I don't want to implement an across-the-board disallow on catalog URLs. Or are you saying that my 500 page should have a meta tag disallowing robot indexing?
shawnr
+3  A: 

It's probably because you are telling Google to index it by having this in your meta-tags:

<meta name="robots" content="index,follow">

Try removing that! :)

Dan Diplo
Thank you, Dan!
shawnr