views:

310

answers:

1

Hi all. Does anyone know if there's a real benefit regarding decreasing collision probability by combining hash functions? I especially need to know this regarding 32 bit hashing, namely combining Adler32 and CRC32. Basically, will adler32(crc32(data)) yield a smaller collision probability than crc32(data)? The last comment here gives some test results in favor of combining, but no source is mentioned. For my purpose, collision is not critical (i.e. the task does not involve security), but I'd rather minimize the probability anyway, if possible. PS: I'm just starting in the wonderful world of hashing, doing a lot of reading about it. Sorry if I asked a silly question, I haven't even acquired the proper "hash dialect" yet, probably my Google searches regarding this were also poorly formed. Thanks.

+5  A: 

This doesn't make sense combining them in series like that. You are hashing one 32-bit space to another 32-bit space.

In the case of a crc32 collision in the first step, the final result is still a collision. Then you add on any potential collisions in the adler32 step. So it can not get any better, and can only be the same or worse.

To reduce collisions, you might try something like using the two hashes independently to create a 64-bit output space:

adler32(data) << 32 | crc32(data)

Whether there is significant benefit in doing that, I'm not sure.

Note that the original comment you referred to was storing the hashes independently:

Whichever algorithm you use there is going to be some chance of false positives. However, you can reduce these chances by a considerable margin by using two different hashing algorithms. If you were to calculate and store both the CRC32 and the Alder32 for each url, the odds of a simultaneous collision for both hashes for any given pair of urls is vastly reduced.

Of course that means storing twice as much information which is a part of your original problem. However, there is a way of storing both sets of hash data such that it requires minimal memory (10kb or so) whilst giving almost the same lookup performance (15 microsecs/lookup compared to 5 microsecs) as Perl's hashes.

Cade Roux
+1 for the clear explanation that the second hash can't remove a collision in the first hash. :)
Guffa
Webmaster