views:

324

answers:

4

I inherited a Drupal 5 site recently and have a series of enhancements to make. Several of then revolve around search results.

  1. Unpublished pages showing up in search engine results. Some of these are old pages, others are recently unpublished. All are correctly marked as unpublished in the CMS and are still showing up.

  2. Outdated pages are showing up from the search engine. The URL path structure changed and those items are old results in the DB.

From what I can tell the site uses Google Search Appliance(GSA) for the search rather than the default Drupal search. Is there a way I can be certain that it's using GSA other than seeing the module enabled?

If it is GSA it seems that I could get someone with access to the GSA to rebuild the search results on the site. Is this correct?

If rebuilding the search results is the right way to go about it, it seems whenever a fair amount of content is removed from the site I'll need to get someone to rebuild the search. Is there a better/automatic way?

+1  A: 

Sounds like it's drupal that is handling the search. Google would need db access to show unpublished nodes. It could be you are using views to do search but forgot to only take published nodes.

If Drupal is handling the searchyou just need to flush and rebuild the search index. This can be done without too much trouble if you don't have too much content.

googletorp
It turns out that permissions are off and anonymous users can access that content so it may be something else.
easement
A: 

I have posted an answer to your more general question concerning node access. The problem with your search results might well be related to that.

Henrik Opel
A: 

In order to keep the Google Appliance more up to date, you might try out XmlSiteMap, a module that publishes a proper xml sitemap for all your content.

For an online website, publishing a sitemap is a good way to keep the search engines up to date, as they can use it to know about new pages and to purge old pages. I'm assuming that the Google Appliance would use this too,.

Bevan
+1  A: 

The GSA could still be showing deleted content depending on what your data source is.

If the content is coming from a database feed and is then dropped from the query it would be dropped. If the content was coming from a natural crawl or through a custom connector feed it would not be removed from the index on delete. Instead it needs to naturally cycle out of the index which can take a while.

One way to block deleted url's from being displayed is to do it through the front end. In the GSA Admin interface go to Serving > Front Ends then choose your front end and click the Remove URL tab. You can either list your url's or block a group of url's through regular expressions.

DMurph11