ansaurus

Question

SHA-1-based directory structure and NTFS limitations?

Answer 1

+3 A:

Some observations:

You split after 4 and 10 more chars. 4 chars on their own can lead to 65536 entries in a directory, 10 chars will lead to 16^10 entries, which is certainly way too much (and there are still more charactes remaining ...)
So the next question is: How did you choose this numbers? They look to me like magic numbers. You seem to hope that your splits will do the work in all cases ...

Your question about the directory deepth that can be handled is good - and I can't answer it. But you should have a look, if 20 nested directories are too much to handle, because 20 levels allow you to keep a maximum of 256 entries per level:

xx/xx/xx/xx/xx/...

On the other hand you could stick with your 4 characters, which would lead to a depth of 10 and 65536 entries maximum:

xxxx/xxxx/xxxx/xxxx/xxxx/...

However - in both cases I'd probably write a dynamic algorithm, which checks the number of items per level and introduces new subfolders as you need them. So the first 256 (or 65536) items would just go to one directory.

tanascius 2009-12-14 17:44:54

Yeah they were fairly magic numbers. It was more representative that I could find *some* subset of all 40 chars per level. I thought that at the root-most levels there would be more total so therefore the number of digits per level could grow. That, plus I read somewhere about tens of thousands being a limitation (I don't recall the specifics) hence the magic 4+16+20. I figured that I wouldn't be able to saturate any one directory but didn't want a chance clustering to get too out of hand. Total guesswork, hence the question about providing a better heuristic.

McKAMEY 2009-12-15 02:08:39

Ok, my point is: don't guess - use a dynamic algorithm which will introduce new levels of hierarchy when neccessary. But I can't provide a number, when your algorithm should do that. Although I'am shure that having 2 or 4 chars per level is no problem :)

tanascius 2009-12-15 15:52:50

Just to be clear, I'm talking about using the SHA-1 itself as the path. I need to be able to check if a file is there or not, which means I need the structure to be predictable. The only thing that could dynamically change is the number of digits per level in the hierarchy. I don't see how I could dynamically change the directory structure without moving *hundreds of thousands* of files once it grew beyond a certain size.

McKAMEY 2009-12-16 04:53:07

Answer 2

+1 A:

Add a collision detector and resolver. You had better be ready in case someone tries to check in SHA-1 collision vectors.

I've not seen any SHA-1 collisions yet but I did see a bad case of an accidental MD5 collision where someone thought they were unique.

Anyway, NTFS uses BTree directory structures so you really could place all in one folder. Windows Explorer won't like it though.

Joshua 2009-12-14 18:00:50

Answer 3

A:

Thanks to the other answerers for their insight.

It sounds like from other questions around the web that NTFS can handle the sizes, but Windows Explorer and network operations will potentially choke at much lower thresholds. I ran a simulation of a very even random distribution similar to what SHA-1 would produce for a random set of 1,000,000 "files".

Windows Explorer definitely did not like a directory width of 4 as it very quickly approached the maximum (65536) for that level. I tweaked the top two directory lengths to be 3 each (4096 max), and put the remaining 34 digits in the third level to attempt to balance depth versus probability of too many directories per level. This seems to allow Windows Explorer to handle browsing the structure.

Here's my simulation:

const string Root = @"C:\_Sha1Buckets";
using (TextWriter writer = File.CreateText(@"C:\_Sha1Buckets.txt"))
{
    // simulate a very even distribution like SHA-1 would produce
    RandomNumberGenerator rand = RandomNumberGenerator.Create();
    byte[] sha1 = new byte[20];
    Stopwatch watch = Stopwatch.StartNew();

    for (int i=0; i<1000000; i++)
    {
     // populate bytes with a fake SHA-1
     rand.GetBytes(sha1);

     // format bytes into hex string
     string hash = FormatBytes(sha1);

     // C:\_Sha1Buckets
     StringBuilder builder = new StringBuilder(Root, 60);

     // \012\345\6789abcdef0123456789abcdef01234567\
     builder.Append(Path.DirectorySeparatorChar);
     builder.Append(hash, 0, 3);
     builder.Append(Path.DirectorySeparatorChar);
     builder.Append(hash, 3, 3);
     builder.Append(Path.DirectorySeparatorChar);
     builder.Append(hash, 6, 34);
     builder.Append(Path.DirectorySeparatorChar);

     Directory.CreateDirectory(builder.ToString());
     if (i % 5000 == 0)
     {
      // write out timings every five thousand files to see if changes
      writer.WriteLine("{0}: {1}", i, watch.Elapsed);
      Console.WriteLine("{0}: {1}", i, watch.Elapsed);
      watch.Reset();
      watch.Start();
     }
    }

    watch.Reset();
    Console.WriteLine("Press any key to delete the directory structure...");
    Console.ReadLine();
    watch.Start();
    Directory.Delete(Root, true);
    writer.WriteLine("Delete took {0}", watch.Elapsed);
    Console.WriteLine("Delete took {0}", watch.Elapsed);
}

After about fifty thousand, the simulation appears to slow down a bit (15-20 sec per 5000) but stays at that rate. The delete at the end took over 30 min on my machine!

The distributions work out like this for 1 million hashes:

1st level has 4096 folders
2nd level has average of 250 folders
3rd level has an average of 1 folder

That is very manageable within Windows Explorer and doesn't seem to get too deep or wide. Obviously if the distribution weren't this even, then we could run into problems, but only at the third level. The first two levels are bounded at 4096. I suppose if the target set were larger, we could add an additional level and gain a lot of growth potential. For my application 1 million is a very reasonable upper bound.

Anyone have any thoughts on the validity of such a test for determining directory structure heuristics?

McKAMEY 2009-12-15 17:21:33

ansaurus

tags:

views:

answers:

SHA-1-based directory structure and NTFS limitations?

related questions