views:

21

answers:

2

We're using the Google CSE (Custom Search Engine) paid service to index content on our website. The site is built of mostly PHP pages that are assembled with include files, but there are some dynamic pages that pull info from a database into a single page template (new releases for example). The issue we have is I can set an expire date on the content in the database so say "id=2" will bring up a "This content is expired" notice. However, if ID 2 had an uploaded PDF attached to it, the PDF file remains in the search index.

I know I could write a cleanup script and have cron run it that looks at the db, finds expired content, checks to see if any uploaded files were attached and either renames or removes them, but there has to be a better solution (I hope).

Please let me know if you have encountered this in the past, and what you suggest.

Thanks, D.

A: 

There's unfortunately no way to give you a straight answer at this time: we have no knowledge of how your PDFs are "attached" to your pages or how your DB is structured.

The best solution would be to create a robots.txt file that blocks the URLs for the particular PDF files that you want to remove. Google will drop them from the index on its next pass (usually in about an hour).

http://www.robotstxt.org/

mattbasta
I hadn't thought about writing to the robots file... that may work. The files are uploaded via an upload script, and a value is stored in the DB for filename. All the files go to the same directory, so something like http://www.domainname.com/uploads/pdffilehere.pdf would be the path, and "pdffilehere.pdf" would be stored in the "url" column of the DB.
Don
A: 

What we ended up doing was tying a check script to the upload script that once it completed the current upload, old files were "unlinked" and the DB records were deleted.

For us, this works because it's kind of an "add one/remove one" situation where we want a set number of of items to appear in a rolling order.

Don