views:

291

answers:

5

I have a web server which saves cache files and keeps them for 7 days. The file names are md5 hashes, i.e. exactly 32 hex characters long, and are being kept in a tree structure that looks like this:

00/
  00/
    00000ae9355e59a3d8a314a5470753d8
    .
    .
00/
  01/

You get the idea.

My problem is that deleting old files is taking a really long time. I have a daily cron job that runs

find cache/ -mtime +7 -type f -delete

which takes more than half a day to complete. I worry about scalability and the effect this has on the performance of the server. Additionally, the cache directory is now a black hole in my system, trapping the occasional innocent du or find.

The standard solution to LRU cache is some sort of a heap. Is there a way to scale this to the filesystem level? Is there some other way to implement this in a way which makes it easier to manage?

Here are ideas I considered:

  1. Create 7 top directories, one for each week day, and empty one directory every day. This increases the seek time for a cache file 7-fold, makes it really complicated when a file is overwritten, and I'm not sure what it will do to the deletion time.
  2. Save the files as blobs in a MySQL table with indexes on name and date. This seemed promising, but in practice it's always been much slower than FS. Maybe I'm not doing it right.

Any ideas?

+1  A: 

How about having a table in your database that uses the hash as the key. The other field would then be the name of the file. That way the file can be stored in a date-related fashion for fast deletion, and the database can be used for finding that file's location based on the hash in a fast fashion.

David Arno
+1  A: 

Reiserfs is relatively efficient at handling small files. Did you try different Linux file systems? I'm not sure about delete performance - you can consider formatting (mkfs) as a substitute for individual file deletion. For example, you can create a different file system (cache1, cache2, ...) for each weekday.

gimel
+13  A: 

When you store a file, make a symbolic link to a second directory structure that is organized by date, not by name.

Retrieve your files using the "name" structure, delete them using the "date" structure.

Tomalak
Bugger :) You beat me to it. +1 this answer.
OJ
Just be sure to remove both the original file and the link. You don't want lots of dead links there, and it's also easy to just remove the link and not remove the original file.
Ben Combee
+1  A: 

How about this:

  • Have another folder called, say, "ToDelete"
  • When you add a new item, get today's date and look for a subfolder in "ToDelete" that has a name indicative of the current date
  • If it's not there, create it
  • Add a symbolic link to the item you've created in today's folder
  • Create a cron job that goes to the folder in "ToDelete" which is of the correct date and delete all the folders that are linked.
  • Delete the folder which contained all the links.
OJ
+4  A: 

Assuming this is ext2/3 have you tried adding in the indexed directories? When you have a large number of files in any particular directory the lookup will be painfully slow to delete something.
use tune2fs -o dir_index to enable the dir_index option.
When mounting a file system, make sure to use noatime option, which stops the OS from updating access time information for the directories (still needs to modify them).
Looking at the original post it seems as though you only have 2 levels of indirection to the files, which means that you can have a huge number of files in the leaf directories. When there are more than a million entries in these you will find that searches and changes are terribly slow. An alternative is to use a deeper hierarchy of directories, reducing the number of items in any particular directory, therefore reducing the cost of search and updates to the particular individual directory.

Petesh