tags:

views:

570

answers:

7

What is the fastest way to create a hash function which will be used to check if two files are equal?

Security is not very important.

Edit: I am sending a file over a network connection, and will be sure that the file on both sides are equal

+5  A: 

One approach might be to use a simple CRC-32 algorithm, and only if the CRC values compare equal, rerun the hash with a SHA1 or something more robust. A fast CRC-32 will outperform a cryptographically secure hash any day.

Greg Hewgill
I'd say that hashing a file is likely to be I/O-bound anyhow, so you might as well use a hash with good distribution and a large range (certainly any crypto hash qualifies).
Steven Sudit
+4  A: 

Why do you want to hash it?

If you want to make sure that two files are equal then by definition you will have to read the entire file (unless they are literally the same file, in which case you can tell by looking at meta-data on the file system). Anyways, no reason to hash, just read over them and see if they are the same. Hashing will make it less efficient. And even if the hashes match, you still aren't sure if the files really are equal.

tster
I am sending a file over a network connection, and will be sure that the file on both sides are equal.
eflles
Oh, well in that case just use a real hash algorithm. I guarantee your network will be slower than the hash.
Greg Hewgill
In such a case, use an already existing hash function. Greg, posted some good examples.
tster
+1  A: 

If it's only a one off then given that you'll have to read both files to generate a hash of both of them, why not just read through a small amount of each at a time and compare?

Failing that CRC is a very simple algorithm.

Jeff Foster
+1  A: 

You could try MurmurHash, which was specifically designed to be fast, and is pretty simple to code. You might want to and a second, more secure hash if MurmurHash returns a match though, just to be sure.

int3
The OP stated that security was not a consideration here, so I'm not sure why a second hash would help. Instead, I'd suggest using one of the 64-bit variants of Murmur.
Steven Sudit
A: 

For this type of application, Adler32 is probably the fastest algorithm, with a reasonable level of security. For bigger files, you may calculate multiple hash values, for example one per block of 5 Mb of the file, hence decreasing the chances of errors (i.e. of cases when the hashes are same yet the file content differ). Furthermore this multi-hash values setup may allow the calculation of the hash to be implemented in a multi-thread fashion.

Edit: (Following Steven Sudit's remark)
A word of caution if the files are small!
Adler32's "cryptographic" properties, or rather its weaknesses are well known particularly for short messages. For this reason the solution proposed should be avoided for files smaller than than a few kilobytes.
Never the less, in the question, the OP explicitly seeks a fast algorithm and waives concerns about security. Furthermore the quest for speed may plausibly imply that one is dealing with "big" files rather than small ones. In this context, Adler32, possibly applied in parallel for files chunks of say 5Mb remains a very valid answer. Alder32 is reputed for its simplicity and speed. Also, its reliability, while remaining lower than that of CRCs of the same length, is quite acceptable for messages over 4000 bytes.

mjv
I would not recommend Adler32 for any purpose. It has terrible characteristics, particularly for short files.
Steven Sudit
+4  A: 

Unless you're using a really complicated and/or slow hash, loading the data from the disk is going to take much longer than computing the hash.

So to compare two files, use this algorithm:

  • Compare sizes
  • Compare dates (be careful here: this can give you the wrong answer; you must test whether this is the case for you or not)
  • Compare the hashes

This allows for a fast fail (if the sizes are different, you know that the files are different).

To make things even faster, you can compute the hash once and save it along with the file. Also save the file date and size into this extra file, so you know quickly when you have to recompute the hash or delete the hash file when the main file changes.

Aaron Digulla
A: 

you might check out the algorithm that the samba/rsync developers use. I haven't looked at it in depth, but i see it mentioned all the time. apparently its quite good.

bostonBob