views:

37

answers:

1

I need a way to remove "unused" images from my filesystem, i.e. images that are never accessed from any point in my website (doesn't matter if I break external links. I might disable external hotlinking altogether). What's the best way of going about this? Regular users can add multiple attachments to topics/posts and content contributers can bulk upload large numbers of images which can be used in articles or image galleries.

The problem is that the images could be referenced in any of the following ways:

  1. From user content (text/html, possibly Markdown or BBCode) stored in the database
  2. Hardcoded into an HTML page
  3. Hardcoded into a PHP file
  4. Hardcoded into a CSS file
  5. As an "attachment" field in a database table, usually containing only the filename itself with no path, because the application assumes that it would be in a certain folder.

And to top it off, the path of the image could be an absolute or relative HTTP or PHP path and may or may not be built with string concatenation in PHP.

So obviously find/replace or regexing the database or filesystem is out of the question. But luckily for you and me, this system isn't fully implemented yet and I don't need anything that deals with an existing hoard of images. I just need to set up some efficient structure that will allow this in the future.

Some ideas I've thought of:

  • Intercepting the HTTP request for the image with PHP, and keeping track of the HTTP_REFERER. The problem with this is that just because no one has clicked on a link at the time of checking this doesn't mean the link doesn't exist.
  • Use extreme database normalization - i.e. make a table for images and use foreign keys for anything that references it. However this would result in making a metric craptonne of many-to-many relationships (and the crosstables) in addition to being impractical for any regular user to use.
  • Backup all the images and delete them, and check every single 404 request and run a script each time that attempts to find the image from the backup folder and puts it in the "real" folder. The problem is that this cache would have to be purged every so often and the server might be strained when rebuilding the cache.

Ideas/suggestions? Is this just something you have to ignore and live with even if you're making a site with a ridiculous amount of images? Even if it's not worth it, how would something work just for proof-of-concept (I added the garbage-collection tag just because this might be going into that area conceptually).

+1  A: 

I will admit that my experience with this was simpler than yours. I had no 'user generated content' so to speak, and my images were all in only templates or database with full path. But what I did is create a perl script that

  • Analyzed my HTML templates, database table, and CSS generated a list of files
    • In the HTML it looked for <img> tags
    • In the CSS it looked for any .png, .jp*g, or .gif regex strings
    • The tables were easy because I had an Image table for the image data
  • The files list was then ordered to remove duplicates
  • The script iterated through the list and wrote a csv like: filename,(CSS filename|HTML filename|DBTABLE),(exists|notexists) for auditing
  • In another iteration it renamed all files not in the list by appended .del to the filename
  • After regression testing I called the script with a -docleanup tag which told it to go through and delete all the .del appended files.
  • If for whatever reason an image was tagged as .del and shouldn't have been, I just manually renamed it back to its original form.

A couple of notes: I realize that I could have made this script 'smoother' and done multiple things in multiple steps, but its use grew over time and I wanted clearly delineated processing steps so it couldn't ever run amok. I used the CSV to go back and clean up the information where the image didn't exist.

manyxcxi