views:

78

answers:

2

I'm going to be tracking different versions of potentially millions of different files, and my intent is to hash them to determine I've already seen that particular version of the file. Currently, I'm only using MD5 (the product is still in development, so it's never dealt with millions of files yet), which is clearly not long enough to avoid collisions.

However, here's my question - Am I more likely to avoid collisions if I hash the file using two different methods and store both hashes (say, SHA1 and MD5), or if I pick a single, longer hash (like SHA256) and rely on that alone? I know option 1 has 288 hash bits and option 2 has only 256, but assume my two choices are the same total hash length.

Since I'm dealing with potentially millions of files (and multiple versions of those files over time), I'd like to do what I can to avoid collisions. However, CPU time isn't (completely) free, so I'm interested in how the community feels about the tradeoff - is adding more bits to my hash proportionally more expensive to compute, and are there any advantages to multiple different hashes as opposed to a single, longer hash, given an equal number of bits in both solutions?

+2  A: 

I've thought about and toyed with this problem a great deal, and I recommend using SHA256 to stay on the safe side (it's slower, but the CPU should still manage to keep up). I don't know if this significantly weakens the hash strength, but you may want to parcel up the hashes among 16MB blocks (for example), then hash the hashes at the end so you can parallelize.

One lesson I've learned toying with massive numbers of files and hashing is this: adding millions of records to a PostgreSQL database in one sitting isn't very fast. When I wrote a program to hash one million files and store them in a PostgreSQL database, the database was often the bottleneck. I didn't try MySQL, but I speculate it's roughly the same. SQLite is probably much faster since there's no client/server overhead. I recommend trying SQLite first. It might be too slow too.

Also, if you store a million files by hash into a directory and lose the index file, it's kind of hard to find things :)

Joey Adams
Fair enough on the index - I'll have to keep that pretty safe, since it would be agony to have it rebuilt. However, I'm hashing the file as a key, and also storing the file size, so a collision would have to collide both of those items, which seems way less likely than just a hash alone. Maybe with those two pieces of information, I'm safe.
rwmnau
Optimizing to avoid a collision isn't buying you anything that I can see.
mrjoltcola
+1  A: 

For file version tracking, I would think collisions between different files is not a concern. For each file, you are using a hash to determine if that and only that file changed. Whether the hash for that file collides with a different file is irrelevant isn't it?

EDIT: You are applying a hash as an optimization to avoid comparing each new file to millions of existing files. Collisions are not a reason to avoid using a fast hash. Simply deal with the collision case (if it ever happens) by storing the new version of the file anyway. Both hashing schemes will provide the optimization. Why overoptimize for something that probably won't happen. What if you had a super-fast hash that would collide 1 in 1000000. That would not be good for cryptography, but would be fine for versioning control.

Even when using GUIDs, systems detect collisions and handle them. The system need not be optimized for something that will statistically never happen.

mrjoltcola
If two files have the same hash but are different, the file tracker won't know they are different and may end up deleting one of them, resulting in data loss (however minute).
Joey Adams
If that is a concern at all, then a hash function is inappropriate.
mrjoltcola
I am concerned about collisions between different files as well - regardless of what it's named, I want to determine if a file has already been backed up by somebody else on the network. I guess my question is the best way to minimize collision risk.
rwmnau
@mrjoltcola: If the likelihood of any two SHA256 hashes in the world colliding is 1 in ${astronomical number}, you might as well say it will never happen, practically speaking. There is no known SHA256 collision. If one is found in the future, better cryptographic hashing algorithms will be available to replace it.
Joey Adams
@Joey Adams: Does this mean that, if I was to go with SHA256 as my hashing algorithm (I'm also using file size as a key to prevent collisions), I can safely proceed with the assumption that I'll avoid all collisions? The numbers seem to support "don't worry about it", but I'd like to hear that from somebody else.
rwmnau
Yes, this assumption can be considered valid for a long time, at least until someone cracks the SHA256 algorithm or computers get faster. However, you can cross that bridge when you get to it (e.g. by using a new hash function and rehashing all the files (if that's even necessary)). So yes, "don't worry about it".
Joey Adams
@Joey Adams: I agree, but my point is, hashes can be used for achieving "good enough" detection of a previous file. In the "1 in ${astronomical number}" case you find a collision, you deal with it by comparing the file contents and/or storing a unique incarnation. The hashing isn't required to solve all challenges in the system, it is just an optimization to avoid comparing every new file to millions of existing files in the versioning system.
mrjoltcola
@mrjoltcola: Bear in mind a couple alternative use cases: 1.) hash all files, then hash them again at a later date to make sure they are intact, and 2.) Scan files on disk A, remove disk A, insert disk B, then determine if there's duplicity between the disks.
Joey Adams