tags:

views:

60

answers:

2

I'm attempting to diagnose an issue wherein Google reduced my site:mysite.com page count from 86,400 to 17,500 over the course of three days.

Is there any way I can get a complete list of the pages that Google has in its index for my domain?

Details

I normally wouldn't post actual links for my site, but here's a good example: our Device Database (list of devices and related content that the tools we sell support) has over 2200 devices published:

(index at http://www.keil.com/dd/

However, the Google query site:www.keil.com/dd/chip site:keil.com/dd returns only 1,620 results.

Each of the device pages also links to example code, PDFs, and other assets that are also supposed to be indexed, and were until very recently.

This isn't a huge difference. My discussion forum, however, has around 16,627 threads, each with their own pseudo-static page that is updated whenever anyone posts a new thread or a response. The site: query for http://www.keil.com/forum/docs returns only around 5,000 pages.

There are numerous other areas on this site (it's huge, and has been incrementally developed over the past 15 years) that are suffering from the same effect. On April 1st, I was seeing the 86k number from the site: query, where it has been for the past several years (+- based on forum activity). By April 4th, it was in its current state.

Our SEO expert left awhile back. Help.

+2  A: 

Use the Webmaster Tools to list links, report errors, choose which query string parameters are/aren't safe to ignore, and add a sitemap if necessary.

(Did you really have 86,400 real pages of actual unique content?)

bobince
Yes, there are over 90k pages of unique content. Obviously many of those are published from a database (discussion forum threads, documents related to devices we support, etc.) I'm only seeing around 180 crawl errors on the webmaster tools site, all related to links that aren't supposed to be crawled according to my robots.txt. This file hasn't changed in a year. I'll work on resolving those.
David Lively
Yeah, check the parameter settings then, it may have decided that something important is ignorable noise. eg., say, a `page=n` parameter for paging comments, if every comments page inside a thread has a large section of duplicated content and the title is largely the same.
bobince
Hmmm... I wonder if creating a dynamic sitemap.xml that listed all of the forum threads would get around the paging problem. I'll give it a shot. I've accepted your answer, but obviously this will be an on-going issue for quite awhile. Thanks for the starting points.
David Lively
Well if it's a problem with a page parameter being ignored—and that's only a total guess—you can tackle it with the ‘Parameter’ settings in the Webmaster Tools. Sitemaps do generally help too, though they may not be so practical for 90K pages' worth.
bobince
+3  A: 

There are three elements involved in a change like that:

  1. You need to make sure that you have minimal duplication through URLs if you want to have a somewhat correct count to work with. See http://googlewebmastercentral.blogspot.com/2009/10/reunifying-duplicate-content-on-your.html for some ideas on that.

  2. The site:-query count is a very, very rough approximation that is not worth tracking.

  3. Instead of the site:-query, submit Sitemap files with all of your preferred URLs. For those URLs, you will see the indexed URL count in Webmaster Tools (it will only count if the listed URL is indexed exactly like you have included it, so working on fixing duplication through irrelevant URLs is vital).

If you want to know more about which URLs are actually indexed, I'd create separate Sitemap files for logical parts of your site and check the indexed URL count there. It might be that the non-indexed URLs are not so important (eg detail pages might not matter so much if a higher-level page is indexed).

John Mueller
+1. My understanding, though, is that the sitemap isn't really used much as opposed to crawling? Also, I understand that site:... is a rough approx, but an order of magnitude difference is pretty signficant.
David Lively
The Sitemap is used for discovering new URLs but also for determining the canonical URLs (when we have multiple URLs that lead to the same content), so it's a good way to determine how many of the preferred URLs are actually indexed. The site:-query can differ by a large amount, it's not meant to be conclusive; it's optimized for speed, not accuracy.
John Mueller