views:

522

answers:

6

For some mysterious reason, Google has indexed both these adresses, that lead to the same page:

/something/some-text-1055.html

and

/index.php?pg=something&id=1055

(short notice - the site has had friendly urls since its launch, I have no idea how google found the "index.php?" url - there are "unfriendly" urls only in the content management system, which is password-restricted)

What can I do to solve the situation? (I have around 1000 pages that are double-indexed.) Somebody told me to use "disallow: index.php?" in the robots.txt file. Right or wrong? Any other suggestions?

+3  A: 

If you use sitemap generators to submit to search engines, you'll want to disallow in them as well. They are likely where Google got your links, from crawling your folder and from checking your logs.

Nerdling
+3  A: 

Better check what URI has been requested ($_SERVER['REQUEST_URI']) and redirect if it was /index.php.

Gumbo
+10  A: 

You'd be surprised as how pervasive and quick the google bots are at indexing site content. That, combined with lots of CMS systems creating unintended pages/links making it likely that at some point those links were exposed is the most likely culprit. It's also possible your administration area isn't as secure as you think, the google bot got through that way.

The well-behaved, and google recommended, things to do here are

  1. If possible, create 301 redirects from you query string style URLs to your canonical style URLs. That's you saying "hey there, web bot/browser, the content that used to be at this URL is now at this other URL"

  2. Block the query string content in your robots.txt. That's like asking the spiders or other automated programs "Hey, please don't look at this stuff. These aren't the URLs you're looking for"

  3. Google apparently allows you to specify a canonical URL now via a <link /> tag in the top of your page. Consider adding these in.

As to whether doing the well behaved things is the the "right" thing to do re: Google rankings ... who knows. Only "Google" knows how their algorithms work now, and will work in the future, and by Google, I mean a bunch of engineers and executives with conflicting goals on how search should work.

Alan Storm
Canonical URL via <link /> is the way to go. Or a sitemap.
spoon16
+1  A: 

Changing robots.txt will not help, since the page is already indexed.

The best is to use a permanent redirect (301).

If you want to remove a page once indexed by Google the only way, more or less, is to make it return a 404 not found message.

stefpet
+7  A: 

Google now offers a way to specify a page's canonical URL. You can use the following code in your HTML to tell Google your canonical URL:

<link rel="canonical" href="http://www.example.com/product.php?item=swedish-fish" />

You can read more about canonical URLs on Google on their blog post on the subject, here: http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html According to the blog post, Ask.com, Microsoft Live Search and Yahoo! all support the canonical tag.

sjstrutt
I did not know that! Very cool.
David Grayson
+1  A: 

Is it possible you're posting a form to a similar url and google is simply picking it up from the source?

MK_Dev