tags:

views:

224

answers:

4

I'm using PHP to make a simple caching system, but I'm going to be caching up to 10,000 files in one run of the script. At the moment I'm using a simple loop with

$file = "../cache/".$id.".htm";
$handle = fopen($file, 'w');
fwrite($handle, $temp);
fclose($handle);

($id being a random string which is assigned to a row in a database)

but it seems a little bit slow, is there a better method to doing that? Also I read somewhere that on some operating systems you can't store thousands and thousands of files in one single directory, is this relevant to CentOS or Debian? Bare in mind this folder may well end up having over a million small files in it.

Simple questions I suppose but I don't want to get scaling this code and then find out I'm doing it wrong, I'm only testing with chaching 10-30 pages at a time at the moment.

A: 

File I/O in general is relatively slow. If you are looping over 1000's of files, writing them to disk, the slowness could be normal.

I would move that over to a nightly job if that's a viable option.

Zack
Well I can also have it so that it only caches on page request, I suppose in this case that would be a better option?
zuk1
+3  A: 

Remember that in UNIX, everything is a file.

When you put that many files into a directory, something has to keep track of those files. If you do an :-

ls -la

You'll probably note that the '.' has grown to some size. This is where all the info on your 10000 files is stored.

Every seek, and every write into that directory will involve parsing that large directory entry.

You should implement some kind of directory hashing system. This'll involve creating subdirectories under your target dir.

eg.

/somedir/a/b/c/yourfile.txt /somedir/d/e/f/yourfile.txt

This'll keep the size of each directory entry quite small, and speed up IO operations.

Paul Alan Taylor
Ok this is actually damned easy to do given the way my system will be, thanks, this is the sort of info I was looking for.
zuk1
you should accept the answer, then.
ithcy
I was going to, however I wanted to wait a while incase anyone else weighed in with a contrary/improved answer.
zuk1
"You should implement some kind of directory hashing system." Most filesystems do this for you.
jrockway
Is this the same with folders listings. IE my cache folder has 100,000,000 sub folders in it, would requesting a file from one of those subfoldes be slow because of the amount of folders in it's parent folder?
zuk1
jrockway may be able to speak to this with more authority, but I don't think NTFS works the same way as some of the UN*X fs's - employing a master file table instead.
Paul Alan Taylor
A: 

The number of files you can effectively use in one directory is not op. system but filesystem dependent.

You can split your cache dir effectively by getting the md5 hash of the filename, taking the first 1, 2 or 3 characters of it and use it as a directory. Of course you have to create the dir if it's not exsists and use the same approach when retrieving files from cache.

For a few tens of thousands, 2 characters (256 subdirs from 00 to ff) would be enough.

Csaba Kétszeri
A: 

You may want to look at memcached as an alternative to filesystems. Using memory will give a huge performance boost.

http://php.net/memcache/

Al