views:

492

answers:

5

I have a directory with 500,000 files in it. I would like to access them as quickly as possible. The algorithm requires me to repeatedly open and close them (can't have 500,000 file open simultaneously).

How can I do that efficiently. I had originally thought that I could cache the inodes and open the files that way, but *nix doesn't provide a way to open files by inode (security or some such).

The other option is to just not worry about it and hope the FS does good job on file look up in a directory. If that is the best option, which FS's would work best. Do certain filename patterns look up faster than others? eg 01234.txt vs foo.txt

BTW this is all on Linux.

+5  A: 

A couple of ideas:

a) If you can control the directory layout then put the files into subdirectories.

b) If you can't move the files around, then you might try different filesystems, I think xfs might be good for directories with lots of entries?

Douglas Leeder
Subdirectories might help. The Filesystem opens and caches the subdirectories. One directory of 500K is really big. 1000 directories of 500 files might allow smaller, faster directories caches.
S.Lott
+1 for the sub directories. With ext[23] the number of files in a single directory increases it can get pretty slow.
Chris Kloberdanz
+2  A: 

If you've got enough memory, you can use ulimit to increase the maximum number of files that your process can have open at one time, I have successfully done with with 100,000 files, 500,000 should work as well.

If that isn't a option for you, try to make sure that your dentry cache has enough room to store all the entries. The dentry cache is the filename -> inode mapping that the kernel uses to speed up file access based on filename, accessing huge numbers of different files can effectively eliminate the benefit of the dentry cache as well as introduce an additional performance hit. Stock 2.6 kernel has a hash with up to 256 * MB RAM entries in it at a time, if you have 2GB of memory you should be okay for up to a little over 500,000 files.

Of course, make sure you perform the appropriate profiling to determine if this really causes a bottlneck.

Robert Gamble
A: 

Another question is how much data is in the files? Is an SQL back end an option?

The Digital Ninja
I tried SQL. The contents of the files are a sorted list of ID's and and values. The file names are ID's as well. I ran this with an SQLite database and it took 26 hours to create and index on just the filename IDs. The SQL people suggested to not use a DB as I only need and index.
caspin
+4  A: 

Assuming your file system is ext3, your directory is indexed with a hashed B-Tree if dir_index is enabled. That's going to give you as much a boost as anything you could code into your app.

If the directory is indexed, your file naming scheme shouldn't matter.

http://lonesysadmin.net/2007/08/17/use-dir_index-for-your-new-ext3-filesystems/

Corbin March
+1  A: 

The traditional way to do this is with hashed subdirectories. Assume your file names are all uniformly-distributed hashes, encoded in hexadecimal. You can then create 256 directories based on the first two characters of the file name (so, for instance, the file 012345678 would be named 01/2345678). You can use two or even more levels if one is not enough.

As long as the file names are uniformly distributed, this will keep the directory sizes manageable, and thus make any operations on them faster.

CesarB