views:

1741

answers:

6

Hello everyone,

I'm working on a program that searches entire drives for a given file. At the moment, I calculate an MD5 hash for the known file and then scan all files recursively, looking for a match.

The only problem is that MD5 is painfully slow on large files. Is there a faster alternative that I can use while retaining a very small probablity of false positives?

All code is in C#.

Thank you.

Update

I've read that even MD5 can be pretty quick and that disk I/O should be the limiting factor. That leads me to believe that my code might not be optimal. Are there any problems with this approach?

        MD5 md5 = MD5.Create();
        StringBuilder sb = new StringBuilder();
        try
        {
            using (FileStream fs = File.Open(fileName, FileMode.Open, FileAccess.Read))
            {
                foreach (byte b in md5.ComputeHash(fs))
                    sb.Append(b.ToString("x2").ToLower());
            }
            return sb.ToString().ToUpper();
        }
        catch (Exception)
        {
            return "";
        }
+2  A: 

I recommend you take a tiered approach. Use a fast but collision prone hash (like Super Fast Hash or Murmur2) as a "rough estimate". This will give you a small list of files which may be the same, you can then MD5 all of the files in this list to obtain the "correct" answer.

This will probably be faster if you use a fast hash for the first round, but since you have to hash some of the files twice it may be slower. It is certainly worth a try, though.

SoapBox
I dont think the problem is with the speed of the hash function, its with the time it takes to read the data from the drive.
Sam Saffron
+4  A: 

just read the file linearly? It seems pretty pointless to read the entire file, compute a md5 hash, and then compare the hash.

Reading the file sequentially, a few bytes at a time, would allow you to discard the vast majority of files after reading, say, 4 bytes. And you'd save all the processing overhead of computing a hashing function which doesn't give you anything in your case.

If you already had the hashes for all the files in the drive, it'd make sense to compare them, but if you have to compute them on the fly, there just doesn't seem to be any advantage to the hashing.

Am I missing something here? What does hashing buy you in this case?

jalf
Unfortunately I won't have access to the original file when the program is running so storing a hash (actually many hashes) is the only way I can compare.
paulbeesley
At the very least, if you can store the hash plus the first few bytes (preferably more than 4 because often that's the size of file format magic numbers), then you can discard the vast majority of cases having only opened the file and read a few bytes.
Steve Jessop
+3  A: 

First consider what is really your bottleneck: the hash function itself or rather a disk access speed? If you are bounded by disk, changing hashing algorithm won't give you much. From your description I imply that you are always scanning the whole disk to find a match - consider building the index first and then only match a given hash against the index, this will be much faster.

Adam Byrtek
+16  A: 

I hope you're checking for an MD5 match only if the file size already matches.

Another optimization is to do a quick checksum of the first 1K (or some other arbitrary, but reasonably small number) and make sure those match before working the whole file.

Of course, all this assumes that you're just looking for a match/nomatch decision for a particular file.

Michael Burr
+3  A: 

There is one small problem with using MD5 to compare files: there are known pairs of files which are different but have the same MD5.

This means you can use MD5 to tell if the files are different (if the MD5 is different, the files must be different), but you cannot use MD5 to tell if the files are equal (if the files are equal, the MD5 must be the same, but if the MD5 is equal, the files might or might not be equal).

You should either use a hash function which has not been broken yet (like SHA-1), or (as @SoapBox mentioned) use MD5 only as a fast way to find candidates for a deeper comparison.

References:

CesarB
I never knew that. Thank you!
paulbeesley
Correct, but this applies to hashing in general. When the hash is n Bits long, there are only 2^n possible hash values. But the number of different files is countably infinite. Thus, the number of pairs of different files that have the same hash value is also countably infinite.
Ingo
@Ingo: yeah, but for MD5, we know how to create a pair of files with the same hash value (not only that, but several such pairs are already known). For cryptographic hashes which have not been broken yet, we cannot create such a pair on purpose, and creating it by accident has an extremly small probability, small enough that we can treat it as if it were not possible at all (at least until that hash gets broken too).
CesarB
@CesarB: His use case doesn't seem be one that cares about attack, so it might not matter.
Scott Stafford
+2  A: 

Regardless of cryptographic requirements, the possibility of a hash collision exists, so no hashing function can be used to guarantee that two files are identical.

I wrote similar code a while back which I got to run pretty fast by indexing all the files first, and discarding any with a different size. A fast hash comparison (on part of each file) was then performed on the remaining entries (comparing bytes for this step was proved to be less useful - many file types have common headers which have identical bytes at the start of the file). Any files that were left after this stage were then checked using MD5, and finally a byte comparison of the whole file if the MD5 matched, just to ensure that the contents were the same.

Rich.
Sounds like a good, logical approach - thanks for chiming in.
paulbeesley