views:

53

answers:

5

Is there any fast algorithm that allows to compare two files (for verification purpose) without need to read the entire contents?

+1  A: 

You could use a MD5 hash on both files and compare them that way. However it does technically read the whole file. You won't be able to have 100% certainty without checking I don't think.

In C# one would do this in the following way (sorry, you didn't mention a specific language):

protected string GetMD5HashFromFile(string fileName)
{
    byte[] retVal = { };

    using (FileStream file = new FileStream(fileName, FileMode.Open))
    using (MD5 md5 = new MD5CryptoServiceProvider())
    {
        retVal = md5.ComputeHash(file);
    }

    if (retVal.Length > 0)
    {
        StringBuilder sb = new StringBuilder();

        for (int i = 0; i < retVal.Length; i++)
        {
            sb.Append(retVal[i].ToString("x2"));
        }

        return sb.ToString();
    }
    else
    {
        return string.Empty;
    }
}

bool CompareFiles(string fileName1, string fileName2)
{
    return (GetMD5HashFromFile(fileName1) == GetMD5HashFromFile(fileName2));
}
Kyle Rozendo
Thanks Kyle, I'll give it a try
Sphynx
It doesn't 'technically' read the entire file, it *reads the whole file*!
bmargulies
Which is, of course, the implication of saying "technically".
Kyle Rozendo
A: 

You could write a custom CRC procedure that reads bits of the file. e.g. 16 bytes for every 1k or something like that instead of CRCing the whole file. It's riskier, of course, since data could possibly change where you're not looking and not have an effect on your compared blocks. But CRC is a bit risky too since two very different data sets could return the same value.

Paul Sasik
A: 

I'm afraid you cannot avoid a full read of both files to be completely sure they're equals.

You can first check both file's size; if they're different, the files are different (but what about text files which would only differ on the line separator ?).

If size is the same, I don't see any correct way to go further but starting to read both files. Of course it can stop as soon as buffer differs, but it could only state that files are really equals when last character as been processed.

zim2001
A: 

There is no algorithm to be 100% sure the files are the same unless you read every byte. The proof is simple - assume such an algorithm exists, and we use it to compare two files. That implies some number of bytes are not read by the algorithm. I can cause the algorithm to fail by changing those bytes in one file but not the other.

Peter Recore
+1  A: 

It is mathematically impossible to determine that two files of the same size are equal without reading both of them entirely, but it is very possible to determine that they are not equal without reading both entirely. This can be done in various ways, such as using hash functions or short circuit comparison.

M.A. Hanin