views:

4570

answers:

10

Typical approaches recommend reading the binary via FileStream and comparing it byte-by-byte.

  • Would a checksum comparison such as CRC be faster?
  • Are there any .NET libraries that can generate a checksum for a file?
+24  A: 

A checksum comparison will be most likely be slower than a byte-by-byte comparison.

In order to generate a checksum, you'll need to load each byte of the file, and perform processing on it. You'll then have to do this on the second file. The processing will almost definitely be slower than the comparison check.

As for generating a checksum: You can do this easily with the cryptography classes. Here's a short example of generating an MD5 checksum with C#.

However, a checksum may be faster and make more sense if you can pre-compute the checksum of the "test" or "base" case. If you have an existing file, and you're checking to see if a new file is the same as the existing one, pre-computing the checksum on your "existing" file would mean only needing to do the DiskIO one time, on the new file. This would likely be faster than a byte-by-byte comparison.

Reed Copsey
+2  A: 

Edit: This method would not work for comparing binary files!

In .NET 4.0, the File class has the following two new methods:

public static IEnumerable<string> ReadLines(string path)
public static IEnumerable<string> ReadLines(string path, Encoding encoding)

Which means you could use:

bool same = File.ReadLines(path1).SequenceEqual(File.ReadLines(path2));
280Z28
@dtb: It doesn't work for binary files. You were probably already typing the comment when I realized that and added the edit at the top of my post. :o
280Z28
@280Z28: I didn't say anything ;-)
dtb
A: 

If the files are not too big, you can use:

public static byte[] ComputeFileHash(string fileName)
{
 using (var stream = File.OpenRead(fileName))
  return System.Security.Cryptography.MD5.Create().ComputeHash(stream);
}

It will only be feasible to compare hashes if the hashes are useful to store.

(Edited the code to something much cleaner.)

Cecil Has a Name
+10  A: 

In addition to Reed Copsey's answer:

  • The worst case is where the two files are identical. In this case it's best to compare the files byte-by-byte.

  • If if the two files are not identical, you can speed things up a bit by detecting sooner that they're not identical.

For example, if the two files are of different length then you know they cannot be identical, and you don't even have to compare their actual content.

dtb
To be complete: the other big gain is stopping as soon as the bytes at 1 position are different.
Henk Holterman
@Henk: I thought this was too obvious :-)
dtb
Good point on adding this. It was obvious to me, so I didn't include it, but it's good to mention.
Reed Copsey
+2  A: 

The only thing that might make a checksum comparison slightly faster than a byte-by-byte comparison is the fact that you are reading one file at a time, somewhat reducing the seek time for the disk head. That slight gain may however very well be eaten up by the added time of calculating the hash.

Also, a checksum comparison of course only has any chance of being faster if the files are identical. If they are not, a byte-by-byte comparison would end at the first difference, making it a lot faster.

You should also consider that a hash code comparison only tells you that it's very likely that the files are identical. To be 100% certain you need to do a byte-by-byte comparison.

If the hash code for example is 32 bits, you are about 99,99999998% certain that the files are identical if the hash codes match. That is close to 100%, but if you truly need 100% certainty, that's not it.

Guffa
+3  A: 

The slowest possible method is to compare two files byte by byte. The fastest I've been able to come up with is a similar comparison, but instead of one byte at a time, you would use an array of bytes sized to Int64, and then compare the resulting numbers.

Here's what I came up with:

    const int BYTES_TO_READ = sizeof(Int64);

    static bool FilesAreEqual(FileInfo first, FileInfo second)
    {
        if (first.Length != second.Length)
            return false;

        int iterations = (int)Math.Ceiling((double)first.Length / BYTES_TO_READ);

        using (FileStream fs1 = first.OpenRead())
        using (FileStream fs2 = second.OpenRead())
        {
            byte[] one = new byte[BYTES_TO_READ];
            byte[] two = new byte[BYTES_TO_READ];

            for (int i = 0; i < iterations; i++)
            {
                 fs1.Read(one, 0, BYTES_TO_READ);
                 fs2.Read(two, 0, BYTES_TO_READ);

                if (BitConverter.ToInt64(one,0) != BitConverter.ToInt64(two,0))
                    return false;
            }
        }

        return true;
    }

In my testing, I was able to see this outperform a straightforward ReadByte() scenario by almost 3:1. Averaged over 1000 runs, I got this method at 1063ms, and the method below (straightforward byte by byte comparison) at 3031ms. Hashing always came back sub-second at around an average of 865ms. This testing was with an ~100MB video file.

Here's the ReadByte and hashing methods I used, for comparison purposes:

    static bool FilesAreEqual_OneByte(FileInfo first, FileInfo second)
    {
        if (first.Length != second.Length)
            return false;

        using (FileStream fs1 = first.OpenRead())
        using (FileStream fs2 = second.OpenRead())
        {
            for (int i = 0; i < first.Length; i++)
            {
                if (fs1.ReadByte() != fs2.ReadByte())
                    return false;
            }
        }

        return true;
    }

    static bool FilesAreEqual_Hash(FileInfo first, FileInfo second)
    {
        byte[] firstHash = MD5.Create().ComputeHash(first.OpenRead());
        byte[] secondHash = MD5.Create().ComputeHash(second.OpenRead());

        for (int i=0; i<firstHash.Length; i++)
        {
            if (firstHash[i] != secondHash[i])
                return false;
        }
        return true;
    }
chsh
+1  A: 

It's getting even faster if you don't read in small 8 byte chunks but put a loop around, reading a larger chunk. I reduced the average comparison time to 1/4.

    public static bool FilesContentsAreEqual(FileInfo fileInfo1, FileInfo fileInfo2)
    {
        bool result;

        if (fileInfo1.Length != fileInfo2.Length)
        {
            result = false;
        }
        else
        {
            using (var file1 = fileInfo1.OpenRead())
            {
                using (var file2 = fileInfo2.OpenRead())
                {
                    result = StreamsContentsAreEqual(file1, file2);
                }
            }
        }

        return result;
    }

    private static bool StreamsContentsAreEqual(Stream stream1, Stream stream2)
    {
        const int bufferSize = 2048 * 2;
        var buffer1 = new byte[bufferSize];
        var buffer2 = new byte[bufferSize];

        while (true)
        {
            int count1 = stream1.Read(buffer1, 0, bufferSize);
            int count2 = stream2.Read(buffer2, 0, bufferSize);

            if (count1 != count2)
            {
                return false;
            }

            if (count1 == 0)
            {
                return true;
            }

            int iterations = (int)Math.Ceiling((double)count1 / sizeof(Int64));
            for (int i = 0; i < iterations; i++)
            {
                if (BitConverter.ToInt64(buffer1, i * sizeof(Int64)) != BitConverter.ToInt64(buffer2, i * sizeof(Int64)))
                {
                    return false;
                }
            }
        }
    }
}
Lars
+1  A: 

Another improvement on large files with identical length, might be to not read the files sequentially, but rather compare more or less random blocks.

You can use multiple threads, starting on different positions in the file and comparing either forward or backwards.

This way you can detect changes at the middle/end of the file, faster than you would get there using a sequential approach.

thomask
A: 

to use a long array to read a data block and then use “xor” to compare if binary array is identical or not.

If the two files are very large, can try memory mapping file to reduce using space and file I/O speed.

A: 

Comparing two image in c#

reza