I'm writing something that summarizes the files in a file system by hashing a sample of their contents. It constructs a tree of directories and files. Each file entry has the hash of the file contents. For each directory entry, I want to store a hash of the contents of all files in the directory, including those in sub-directories - I'll call this the directory content hash.
The tricky thing about the directory content hash is that I want it to be independent of the structure of the directory. I.E. the hash should be the same if two directories contain the same files, but organized with a different sub-directories structure.
The only two methods I can think of are:
Calculate the MD5 of the concatenation of all file content hashes. In order to get the desired hash properties, I would have to list all of the files in the directory, sort them by their hash, concatenate the sorted hashes, and then run MD5 on the concatenation. This seems slower than I would like. I can do the sorting pretty efficiently by using merge sort while calculating directory content hashes throughout a tree, but I can't get around calculating a lot of MD5 hashes on large inputs.
Combine file content hashes using XOR. Each directory would only need to XOR the file content hashes and directory content hashes of its immediate children. This is very fast and simple, but not very collision resistant. It can't even tell the difference between a directory which contains 1 instance of a file, and a directory which contains three instances of the same file.
It would be nice if there is a function which can be used similar to the way XOR is used in method #2, but is more collision resistant. I think method #1 would be fast enough for this specific case, but in the interest of exploring-all-the-options/intellectual-curiosity/future-applications, I'd like to know whether there's a function that satisfies the description in the title (I have a vague memory of wanting a function like that several times in the past).
Thanks.