tags:

views:

412

answers:

3

I have an app that is storing file-based data under a NTFS directory path which keys off the SHA-1 hash of the data. It has several really nice attributes (de-duplication, impervious to other metadata changes, etc.) but I'm curious about the best practices people have experienced for creating hash-based directory storage structures. My primary concern is the number of files/folders which can be realistically stored at a given folder depth.

Does anyone know what sorts of limitations I'll run into? If I were to dump them all into folders at the root of the storage path, I feel like I would severely limit the ability for the storage to grow. While it won't be a problem soon I'd rather have a structure that avoids this than try to restructure a massive storage later.

If I took an approach to chunk up the signature to create a deeper tree, is there any guidance on how much would I need to chunk it? Would something like this suffice?

StringBuilder foo = new StringBuilder(60);
// ...root, etc.
// SHA-1 always has a length of 40, chunk it up to distribute into smaller groups
// "\0000\0000000000000000\00000000000000000000"
foo.Append(Path.DirectorySeparatorChar);
foo.Append(sha1, 0, 4);
foo.Append(Path.DirectorySeparatorChar);
foo.Append(sha1, 4, 16);
foo.Append(Path.DirectorySeparatorChar);
foo.Append(sha1, 20, 20);

Knowing that SHA-1 has a pretty decent distribution, I would have to assume that eventually there would be large clusters but that on average it would be evenly distributed. It is those clusters that I'm concerned about.

Are there performance penalties when accessing directory structures which are too wide? I know that Windows Explorer will choke, but what about programmatically accessing via C# / System.IO?

+3  A: 

Some observations:

  • You split after 4 and 10 more chars. 4 chars on their own can lead to 65536 entries in a directory, 10 chars will lead to 16^10 entries, which is certainly way too much (and there are still more charactes remaining ...)
  • So the next question is: How did you choose this numbers? They look to me like magic numbers. You seem to hope that your splits will do the work in all cases ...

Your question about the directory deepth that can be handled is good - and I can't answer it. But you should have a look, if 20 nested directories are too much to handle, because 20 levels allow you to keep a maximum of 256 entries per level:

xx/xx/xx/xx/xx/...

On the other hand you could stick with your 4 characters, which would lead to a depth of 10 and 65536 entries maximum:

xxxx/xxxx/xxxx/xxxx/xxxx/...

However - in both cases I'd probably write a dynamic algorithm, which checks the number of items per level and introduces new subfolders as you need them. So the first 256 (or 65536) items would just go to one directory.

tanascius
Yeah they were fairly magic numbers. It was more representative that I could find *some* subset of all 40 chars per level. I thought that at the root-most levels there would be more total so therefore the number of digits per level could grow. That, plus I read somewhere about tens of thousands being a limitation (I don't recall the specifics) hence the magic 4+16+20. I figured that I wouldn't be able to saturate any one directory but didn't want a chance clustering to get too out of hand. Total guesswork, hence the question about providing a better heuristic.
McKAMEY
Ok, my point is: don't guess - use a dynamic algorithm which will introduce new levels of hierarchy when neccessary. But I can't provide a number, when your algorithm should do that. Although I'am shure that having 2 or 4 chars per level is no problem :)
tanascius
Just to be clear, I'm talking about using the SHA-1 itself as the path. I need to be able to check if a file is there or not, which means I need the structure to be predictable. The only thing that could dynamically change is the number of digits per level in the hierarchy. I don't see how I could dynamically change the directory structure without moving *hundreds of thousands* of files once it grew beyond a certain size.
McKAMEY
+1  A: 

Add a collision detector and resolver. You had better be ready in case someone tries to check in SHA-1 collision vectors.

I've not seen any SHA-1 collisions yet but I did see a bad case of an accidental MD5 collision where someone thought they were unique.

Anyway, NTFS uses BTree directory structures so you really could place all in one folder. Windows Explorer won't like it though.

Joshua
A: 

Thanks to the other answerers for their insight.

It sounds like from other questions around the web that NTFS can handle the sizes, but Windows Explorer and network operations will potentially choke at much lower thresholds. I ran a simulation of a very even random distribution similar to what SHA-1 would produce for a random set of 1,000,000 "files".

Windows Explorer definitely did not like a directory width of 4 as it very quickly approached the maximum (65536) for that level. I tweaked the top two directory lengths to be 3 each (4096 max), and put the remaining 34 digits in the third level to attempt to balance depth versus probability of too many directories per level. This seems to allow Windows Explorer to handle browsing the structure.

Here's my simulation:

const string Root = @"C:\_Sha1Buckets";
using (TextWriter writer = File.CreateText(@"C:\_Sha1Buckets.txt"))
{
    // simulate a very even distribution like SHA-1 would produce
    RandomNumberGenerator rand = RandomNumberGenerator.Create();
    byte[] sha1 = new byte[20];
    Stopwatch watch = Stopwatch.StartNew();

    for (int i=0; i<1000000; i++)
    {
     // populate bytes with a fake SHA-1
     rand.GetBytes(sha1);

     // format bytes into hex string
     string hash = FormatBytes(sha1);

     // C:\_Sha1Buckets
     StringBuilder builder = new StringBuilder(Root, 60);

     // \012\345\6789abcdef0123456789abcdef01234567\
     builder.Append(Path.DirectorySeparatorChar);
     builder.Append(hash, 0, 3);
     builder.Append(Path.DirectorySeparatorChar);
     builder.Append(hash, 3, 3);
     builder.Append(Path.DirectorySeparatorChar);
     builder.Append(hash, 6, 34);
     builder.Append(Path.DirectorySeparatorChar);

     Directory.CreateDirectory(builder.ToString());
     if (i % 5000 == 0)
     {
      // write out timings every five thousand files to see if changes
      writer.WriteLine("{0}: {1}", i, watch.Elapsed);
      Console.WriteLine("{0}: {1}", i, watch.Elapsed);
      watch.Reset();
      watch.Start();
     }
    }

    watch.Reset();
    Console.WriteLine("Press any key to delete the directory structure...");
    Console.ReadLine();
    watch.Start();
    Directory.Delete(Root, true);
    writer.WriteLine("Delete took {0}", watch.Elapsed);
    Console.WriteLine("Delete took {0}", watch.Elapsed);
}

After about fifty thousand, the simulation appears to slow down a bit (15-20 sec per 5000) but stays at that rate. The delete at the end took over 30 min on my machine!

The distributions work out like this for 1 million hashes:

  • 1st level has 4096 folders
  • 2nd level has average of 250 folders
  • 3rd level has an average of 1 folder

That is very manageable within Windows Explorer and doesn't seem to get too deep or wide. Obviously if the distribution weren't this even, then we could run into problems, but only at the third level. The first two levels are bounded at 4096. I suppose if the target set were larger, we could add an additional level and gain a lot of growth potential. For my application 1 million is a very reasonable upper bound.

Anyone have any thoughts on the validity of such a test for determining directory structure heuristics?

McKAMEY