views:

147

answers:

1

Say I'm using a hash to identify files, so I don't need it to be secure, I just need to minimize collisions. I was thinking that I could speed the hash up by running four hashes in parallel using SIMD and then hashing the final result. If the hash is designed to take a 512-bit block, I just step through the file taking 4x512 bit blocks at one go and generate four hashes out of that; then at the end of the file I hash the four resulting hashes together.

I'm pretty sure that this method would produce poorer hashes... but how much poorer? Any back of the envelope calculations?

+4  A: 

The idea that you can read blocks of the file from disk quicker than you can hash them is, well, an untested assumption? Disk IO - even SSD - is many orders of magnitude slower than the RAM that the hashing is going though.

Ensuring low collisions is a design criteria for all hashes, and all mainstream hashes do a good job of it - just use a mainstream hash e.g. MD5.

Specific to the solution the poster is considering, its not a given that parallel hashing weakens the hash. There are hashes specifically designed for parallel hashing of blocks and combining the results as the poster said, although perhaps not yet in widespread adoption (e.g. MD6, which withdrew unbroken from SHA3)

More generally, there are mainstream implementations of hashing functions that do use SIMD. Hashing implementers are very performance-aware, and do take time to optimise their implementations; you'd have a hard job equaling their effort. The best software for strong hashing is around 6 to 10 cycles / byte. Hardware accelerated hashing is also available if hashing is the real bottleneck.

Will
If you're worried about collisions I'd bypass MD5 and use something stronger like SHA1, SHA-256 or Whirlpool.
Schwern