views:

33

answers:

2

I have an application that will download and cache, at a minimum, 250,000 8KB* files totaling about 2GB. I need to remove the least recently used file when updating this cache. *These tiny files span two 4KB sectors.

What is the relative cost of obtaining a file handle by name for this type of file in a directory on an NTFS-formatted 5400 RPM drive? If I store all 200K files in one directory will merely getting a file handle take more than a few milliseconds? I can easily bucket the files into different directories.

Windows 7 disables the last access time for files by default, and I don't want to require an administrator to enable this feature. Should I maintain a separate list of file access times in memory (serialized to disk when the app exits?)

Should I consider storing these files in one large flat file? Memory mapping might be difficult if I use anything older than .NET 4.0

+1  A: 

One seek is approximately 15ms on an average 5400rpm drive. The rest is minuscule in comparison.

GregC
I would also recommend placing files in buckets of 2000-3000 files each.
GregC
I also think that using any decent database could be helpful instead of flat directories/files.
GregC
Use Firefox as motivation. They deal with lots of small files on the client side... SQL Lite is used for storing history and file cache. Internet Explorer does files in several bins. Nuf said.
GregC
A: 

Opening 250,000 files -- if that's what you mean -- will take more than a few milliseconds, yes. The size of the directory is less interesting than the fact that you're going through the entire file system stack 250,000 times (everything from NTFS, the kernel, and your grandmother's favorite anti-virus filter all have to get a chance to play).

And last access time isn't rock-solid in any case.

jrtipton
I only need to open one of those files in a few milliseconds, not all of them. I'm trying to figure out if it's how expensive it is to open a file in a highly populated directory. Is it worth it to create subdirectories if I have 250K files in one directory?
Cat
The only added overhead is in the index operations, which is to say doing the lookup in the directory index. The index is a B tree, so the lookup is O(log(n)), which is about 17. So, subjectively, I'd say you're probably alright.
jrtipton
Actually, I should point out that enumerating the directory and inserting new files into it can be tricky perf-wise if short name generation is not disabled and the files have long names. I was just referring to the open case.
jrtipton