tags:

views:

49

answers:

2

I'm writing something that summarizes the files in a file system by hashing a sample of their contents. It constructs a tree of directories and files. Each file entry has the hash of the file contents. For each directory entry, I want to store a hash of the contents of all files in the directory, including those in sub-directories - I'll call this the directory content hash.

The tricky thing about the directory content hash is that I want it to be independent of the structure of the directory. I.E. the hash should be the same if two directories contain the same files, but organized with a different sub-directories structure.

The only two methods I can think of are:

  1. Calculate the MD5 of the concatenation of all file content hashes. In order to get the desired hash properties, I would have to list all of the files in the directory, sort them by their hash, concatenate the sorted hashes, and then run MD5 on the concatenation. This seems slower than I would like. I can do the sorting pretty efficiently by using merge sort while calculating directory content hashes throughout a tree, but I can't get around calculating a lot of MD5 hashes on large inputs.

  2. Combine file content hashes using XOR. Each directory would only need to XOR the file content hashes and directory content hashes of its immediate children. This is very fast and simple, but not very collision resistant. It can't even tell the difference between a directory which contains 1 instance of a file, and a directory which contains three instances of the same file.

It would be nice if there is a function which can be used similar to the way XOR is used in method #2, but is more collision resistant. I think method #1 would be fast enough for this specific case, but in the interest of exploring-all-the-options/intellectual-curiosity/future-applications, I'd like to know whether there's a function that satisfies the description in the title (I have a vague memory of wanting a function like that several times in the past).

Thanks.

A: 

Order independent hashing of collections of hashes (is essentially what you're looking for, non?)

It sounds like any order independent operation (like addition or multiplication) will do the trick for you. Addition has the benefit of overflowing in a nice way. I don't recall if multiplication will work as well.

In short: add all of your values, ignoring the overflow, and you should get something useful. Any other similar function should do the trick if addition isn't sufficiently collision resistant.

Slartibartfast
Whoa, too simple. I tried it with random numbers. Addition is stable and all bits are evenly distributed... Anyone know how collision resistant this is? Thanks.
jthg
Btw, simple multiplication doesn't work. I'm not sure what the effect is called, but the resulting hashes don't come out evenly distributed. Imagine what would happen if any of the inputs is 0.
jthg
Did you use unsigned integers? And yes, zeros are absorbing in multiplication. To avoid absorption, you can replace all zeros by a fixed non-zero number.
Peter G.
No, I used signed integers.
jthg
Is 'too simple' a criticism or a compliment? You should plan for collisions whichever way you go. If you get too many, use another method. I suspect the collision rate will be comparable with the original hash function (but not equal to). Addition will handle 1 vs. 3 files for most cases using cryptographic hash functions ( (A + A + A) mod MAX+1 would have to equal A). Note that you have to do X-bit math (where X bits is the size of the output of your hash function) to take full advantage of your hash output (so that may mean bignum math).
Slartibartfast
Compliment. I went ahead and accumulated both a sum and an xor and subtracted the two in the end. So the hash is sum(x1..xn) - xor(x1..xn)
jthg
Actually, subtracting the xor is not a good idea, it introduces a couple of collision cases. Just using the sum by itself works pretty well. I tried both sum and md5 on my file system and got equivalent results.
jthg
A: 

as the count of items is important but the order isn't; just sort the list of hashes and then hash the list.

find . | xargs sha1sum | cut -c -40 | sort | sha1sum

this would give the type of hash value which is invariant to the directory arrangement

Dan D