tags:

views:

308

answers:

5

We are having troubles getting Google to index the PDF files in our site. There are about 50 PDF and range in size for 20 KB to a little under two megs. They are not protected, can be read annonymously, and inside of PDF Reader, you can search the document.

They are listed in the SiteMap.xml. I can even look at the IIS logs and see Googlebot reading the PDF files, but, except for five, they are never included in the search results.

If I do a filetye:pdf, only five pdfs should up. If I search for text I know is inside a PDF, the PDFs never show up (except for the five that are indexed).

Does anyone have any idea why the over 45+ pdf document are not being included in the index, even though they are in the sitemap and Googlebot is reading them?

+1  A: 

Are you specifying the content-type for Google?

Chris Ballance
+1  A: 

There can be quite a lag between google initially reading your content and it appearing in the index. We recently re-launched a site, submitting sitemaps to google on launch, and it took approx 3 weeks for the new pages to start showing up in search results.

How long ago did you submit these PDFs via your sitemap?

(except for the five that are indexed)

It sounds like your PDFs are being indexed, but it's taking some time. Presuming that there's no difference in the way the non-indexed PDFs have been generated, then I'd suspect it's just the index taking a while to update.

On a slight tangent, one useful tool that I'd recommend signing up for is Google Webmaster - it shows you the crawl rate, problems with your site, sitemaps and indexing within a day or so of the Googlebot hitting your site. It could save you a bit of time going through your IIS logs.

ConroyP
It's been about four weeks since we first submitted our sitemap. I just noticed that last night they indexed four more; so maybe I just need to keep waiting :)
Jim Biddison
When you re-llaunched the site, if it took 3 weeks for the new pages to start showing up in the search rusults, didn't that mean that for 3 weeks, search returned results to pages that no longer existed in your site? didn't this result in a lot of 'page not found' conditions?
Jim Biddison
In our situation, the relaunch co-incided with launch of a new section, old links still functioned - the 3 weeks was the time for the new section to start showing up. The random wait time can be a bit frustrating alright!
ConroyP
A: 

You can try to submit to Google directly, this may speed up the process:

http://www.google.com/submit_content.html

srand
+3  A: 

are all the pdfs located at the same spot? I once had the problem that one of my pdf-locations was inside a folder that was excluded by the robots.txt. Submit your sitemap directly to the google-webmaster tool-site and you may get valuable information as to the whyness of the pdfs not appearing. in my case google told me 'hey, these 54 pdf documents are on your sitemap but due to robots.txt restrictions we cannot index them'. so that was pretty helpful. but mind what the commentator says, it can take a while until this information appears.

Google Webmaster Tools: https://www.google.com/webmasters/tools

tharkun
I'll just add that Google Webmaster Tools does not give all info in real time. It's still a vital resource though.
Liam
No, the PDFs are located is several different places in the site. I have checked and none of them are getting blocked by robots.txt. I have been using Webmaster Tools and submitting Sitemaps, and will cotinue to do so. Thanks for you feedback.Jim
Jim Biddison
A: 

Are your PDF files OCR scanned so the text is selectable and searchable? Or are the PDF files being scanned with no OCR, in which case the text will get stored as a large image? If the PDF is all images I don't think Google can index it (yet). Or has Google found your pages by now?

Bratch