I have an app that is storing file-based data under a NTFS directory path which keys off the SHA-1 hash of the data. It has several really nice attributes (de-duplication, impervious to other metadata changes, etc.) but I'm curious about the best practices people have experienced for creating hash-based directory storage structures. My primary concern is the number of files/folders which can be realistically stored at a given folder depth.
Does anyone know what sorts of limitations I'll run into? If I were to dump them all into folders at the root of the storage path, I feel like I would severely limit the ability for the storage to grow. While it won't be a problem soon I'd rather have a structure that avoids this than try to restructure a massive storage later.
If I took an approach to chunk up the signature to create a deeper tree, is there any guidance on how much would I need to chunk it? Would something like this suffice?
StringBuilder foo = new StringBuilder(60);
// ...root, etc.
// SHA-1 always has a length of 40, chunk it up to distribute into smaller groups
// "\0000\0000000000000000\00000000000000000000"
foo.Append(Path.DirectorySeparatorChar);
foo.Append(sha1, 0, 4);
foo.Append(Path.DirectorySeparatorChar);
foo.Append(sha1, 4, 16);
foo.Append(Path.DirectorySeparatorChar);
foo.Append(sha1, 20, 20);
Knowing that SHA-1 has a pretty decent distribution, I would have to assume that eventually there would be large clusters but that on average it would be evenly distributed. It is those clusters that I'm concerned about.
Are there performance penalties when accessing directory structures which are too wide? I know that Windows Explorer will choke, but what about programmatically accessing via C# / System.IO?