tags:

views:

2380

answers:

5

I'm in a situation where I want to compare two binary files. One of them is already stored on the server with a pre-calculated Crc32 in the database from when I stored it originally.

I know that if the Crc is different then the files are definitely different. However, if the Crc is the same I don't know that the files are. So what I'm looking for is a nice efficient way of comparing the two streams on from the posted file and one from the file system.

I'm not an expert on streams but I'm well aware that I could easily shoot myself in the foot here as far as memory usage is concerned.

Any help is greatly appreciated.

+3  A: 

if you change that crc to a sha1 signature the chances of it being different but with the same signature are astronomicly small

the_ajp
You should never rely on that in most serious apps. It's like just checking the hash in a hashtable lookup without comparing the actual keys!
Mehrdad Afshari
unfortunately you can guarantee that the one time it messes up will be absolutely critical, probably that one big pitch.
Simon Farrow
@Simon - hehe very true.@Mehrdad - No probably not but it would greatly reduce the times you'd have to check to be super uber sure.
the_ajp
Take the CRC and say file size and the changes are ever smaller.
kenny
+2  A: 

You can check the length and dates of the two files even before checking the CRC to possibly avoid the CRC check.

But if you have to compare the entire file contents, one neat trick I've seen is reading the bytes in strides equal to the bitness of the CPU. For example, on a 32 bit PC, read 4 bytes at a time and compare them as int32's. On a 64 bit PC you can read 8 bytes at a time. This is roughly 4 or 8 times as fast as doing it byte by byte. You also would probably wanna use an unsafe code block so that you could use pointers instead of doing a bunch of bit shifting and OR'ing to get the bytes into the native int sizes.

You can use IntPtr.Size to determine the ideal size for the current processor architecture.

Josh Einstein
Could you provide a code sample on how you would do that?
Svish
+7  A: 
static bool FileEquals(string fileName1, string fileName2)
{
    // Check the file size and CRC equality here.. if they are equal...    
    using (var file1 = new FileStream(fileName1, FileMode.Open))
        using (var file2 = new FileStream(fileName2, FileMode.Open))
            return StreamEquals(file1, file2);
}

static bool StreamEquals(Stream stream1, Stream stream2)
{
    const int bufferSize = 2048;
    byte[] buffer1 = new byte[bufferSize]; //buffer size
    byte[] buffer2 = new byte[bufferSize];
    while (true) {
        int count1 = stream1.Read(buffer1, 0, bufferSize);
        int count2 = stream2.Read(buffer2, 0, bufferSize);

        if (count1 != count2)
            return false;

        if (count1 == 0)
            return true;

        // You might replace the following with an efficient "memcmp"
        if (!buffer1.Take(count1).SequenceEqual(buffer2.Take(count2)))
            return false;
    }
}
Mehrdad Afshari
A: 

Thanks Mehrdad!

Khaos
A: 

I sped up the "memcmp" by using a Int64 compare in a loop over the read stream chunks. This reduced time to about 1/4.

    private static bool StreamsContentsAreEqual(Stream stream1, Stream stream2)
    {
        const int bufferSize = 2048 * 2;
        var buffer1 = new byte[bufferSize];
        var buffer2 = new byte[bufferSize];

        while (true)
        {
            int count1 = stream1.Read(buffer1, 0, bufferSize);
            int count2 = stream2.Read(buffer2, 0, bufferSize);

            if (count1 != count2)
            {
                return false;
            }

            if (count1 == 0)
            {
                return true;
            }

            int iterations = (int)Math.Ceiling((double)count1 / sizeof(Int64));
            for (int i = 0; i < iterations; i++)
            {
                if (BitConverter.ToInt64(buffer1, i * sizeof(Int64)) != BitConverter.ToInt64(buffer2, i * sizeof(Int64)))
                {
                    return false;
                }
            }
        }
    }
Lars