views:

2045

answers:

8

Hi,
I have to sync large files across some machines. The files can be up to 6GB in size. The sync will be done manually every few weeks. I cant take the filename into consideration because they can change anytime.

My plan is to create checksums on the destination PC and on the source PC and than copy all files with a checksum, which are not already in the destination, to the destination. My first attempt was something like this:

using System.IO;
using System.Security.Cryptography;

private static string GetChecksum(string file)
{
    using (FileStream stream = File.OpenRead(file))
    {
        SHA256Managed sha = new SHA256Managed();
        byte[] checksum = sha.ComputeHash(stream);
        return BitConverter.ToString(checksum).Replace("-", String.Empty);
    }
}

The Problem was the runtime:
- with SHA256 with a 1,6 GB File -> 20 minutes
- with MD5 with a 1,6 GB File -> 6.15 minutes

Is there a better - faster - way to get the checksum (maybe with a better hash function)?

A: 

You could try using an external tool to copy the files, for example robocopy can copy only changed files.

There are also other options for example Microsoft Groove will look at the binary and only copy the changes, may be worth it for such large files.

Shiraz Bhaiji
+14  A: 

Don't checksum the entire file, create checksums every 100mb or so, so each file has a collection of checksums.

Then when comparing checksums, you can stop comparing after the first different checksum, getting out early, and saving you from processing the entire file.

It'll still take the full time for identical files.

Binary Worrier
I like the idea, but it will not work in my scenario because I will end up with a lot of unchanged files over the time.
crono
+2  A: 

Isn't CRC faster and still sufficient for checking if two files are different?

+5  A: 

Invoke the windows port of md5sum.exe. It's about two times as fast as the .NET implementation (at least on my machine using a 1.2 GB file)

public static string Md5SumByProcess(string file) {
    var p = new Process ();
    p.StartInfo.FileName = "md5sum.exe";
    p.StartInfo.Arguments = file;            
    p.StartInfo.UseShellExecute = false;
    p.StartInfo.RedirectStandardOutput = true;
    p.Start();
    p.WaitForExit();           
    string output = p.StandardOutput.ReadToEnd();
    return output.Split(' ')[0].Substring(1).ToUpper ();
}
Christian Birkl
WOW - using md5sums.exe from pc-tools.net/win32/md5sums makes it really fast. 1681457152 bytes, 8672 ms = 184.91 MB/sec -> 1,6GB ~ 9 secondsThis will be fast enough for my purpose.
crono
+12  A: 

The problem here is that SHA256Managed reads 4096 bytes at a time (inherit from FileStream and override Read(byte[], int, int) to see how much it reads from the filestream), which is too small a buffer for disk IO.

To speed things up (2 minutes for hashing 2 Gb file on my machine with SHA256, 1 minute for MD5) wrap FileStream in BufferedStream and set reasonably-sized buffer size (I tried with ~1 Mb buffer):

// Not sure if BufferedStream should be wrapped in using block
using(var stream = new BufferedStream(File.OpenRead(filePath), 1200000)
{
    // The rest remains the same
}
Anton Gogolev
Thanks for the info - I will check it out
crono
OK - this made the diffence - hashing the 1.6GB file with MD5 took 5.2 seconds on my box (QuadCode @2.6 GHz, 8GB Ram) - even faster as the native implementaion...
crono
i don't get it. i just tried this suggestion but the difference is minimal to nothing. 1024mb file w/o buffering 12-14 secs, with buffering also 12-14 secs - i understand that reading hundreds of 4k blocks will produce more IO but i ask myself if the framework or the native APIs below the framework do not handle this already..
Christian
+2  A: 

You're doing something wrong (probably too small read buffer). On a machine of undecent age (Athlon 2x1800MP from 2002) that has DMA on disk probably out of whack (6.6M/s is damn slow when doing sequential reads):

Create a 1G file with "random" data:

# dd if=/dev/sdb of=temp.dat bs=1M count=1024    
1073741824 bytes (1.1 GB) copied, 161.698 s, 6.6 MB/s

# time sha1sum -b temp.dat
abb88a0081f5db999d0701de2117d2cb21d192a2 *temp.dat

1m5.299s

# time md5sum -b temp.dat
9995e1c1a704f9c1eb6ca11e7ecb7276 *temp.dat

1m58.832s

This is also weird, md5 is consistently slower than sha1 for me (reran several times).

Pasi Savolainen
Yes - I will try to increase the buffer - like Anton Gogolev sugested. I ran it through a "native" MD5.exe which took 9 seconds witth a 1,6 GB file.
crono
+1  A: 

Ok - thanks to all of you - let me wrap this up:

  1. using a "native" exe to do the hashing took time from 6 Minutes to 10 Seconds which is huge.
  2. Increasing the buffer was even faster - 1.6GB file took 5.2 seconds using MD5 in .Net, so I will go with this solution - thanks again
crono
A: 

Hello all.. This is not an answer but am interested in this topic. My question is md5 checksum more accurate than sha or the other forms of calculation since they take large number of bytes at a time?

If you have a question, then post it as a question.
Matt Ellen