views:

319

answers:

4

I need to calculate checksums of quite large files (gigabytes). This can be accomplished using the following method:

    private byte[] calcHash(string file)
    {
        System.Security.Cryptography.HashAlgorithm ha = System.Security.Cryptography.MD5.Create();
        FileStream fs = new FileStream(file, FileMode.Open, FileAccess.Read);
        byte[] hash = ha.ComputeHash(fs);
        fs.Close();
        return hash;
    }

However, the files are normally written just beforehand in a buffered manner (say writing 32mb's at a time). I am so convinced that I saw an override of a hash function that allowed me to calculate a MD5 (or other) hash at the same time as writing, ie: calculating the hash of one buffer, then feeding that resulting hash into the next iteration.

Something like this: (pseudocode-ish)

byte [] hash = new byte [] { 0,0,0,0,0,0,0,0 };
while(!eof)
{
   buffer = readFromSourceFile();
   writefile(buffer);
   hash = calchash(buffer, hash);
}

hash is now sililar to what would be accomplished by running the calcHash function on the entire file.

Now, I can't find any overrides like that in the.Net 3.5 Framework, am I dreaming ? Has it never existed, or am I just lousy at searching ? The reason for doing both writing and checksum calculation at once is because it makes sense due to the large files.

+3  A: 

Hash algorithms are expected to handle this situation and are typically implemented with 3 functions:

hash_init() - Called to allocate resources and begin the hash.
hash_update() - Called with new data as it arrives.
hash_final() - Complete the calculation and free resources.

Look at http://www.openssl.org/docs/crypto/md5.html or http://www.openssl.org/docs/crypto/sha.html for good, standard examples in C; I'm sure there are similar libraries for your platform.

Adam Liss
Good answer, but the "where is it in .net?" part of the question remains open.
Pascal Cuoq
@Pascal: See the 2 good answers below, both of which had been posted before your comment.
Adam Liss
+5  A: 

You use the TransformBlock and TransformFinalBlock methods to process the data in chunks.

// Init
MD5 md5 = MD5.Create();
int offset = 0;

// For each block:
offset += md5.TransformBlock(block, 0, block.Length, block, 0);

// For last block:
md5.TransformFinalBlock(block, 0, block.Length);

// Get the has code
byte[] hash = md5.Hash;

Note: It works (at least with the MD5 provider) to send all blocks to TransformBlock and then send an empty block to TransformFinalBlock to finalise the process.

Guffa
omg, just posted same suggestion, using same formatting =)
Rubens Farias
Ok, but +1 for also providing a reference!
Adam Liss
Ay caramba! There it is! That was the function I was searching for. Good to know I wasn't making it all up. Thanks to Guffa and Rubens for providing the correct answer so promptly. +1 to you both, I will accept this answer because of the included code example.
sindre j
+4  A: 

Seems you can to use TransformBlock / TransformFinalBlock, as shown in this sample: Displaying progress updates when hashing large files

Rubens Farias
A: 

Is it possible to "initialize" the calls to TransformBlock and TransformFinalBlock with the hash value from the previous TransformBlock? For each Transform call I would create a totally new sha1 instance (I'm using sha1 vs. md5) and initialize it with the previous hash value. Does anyone know if the sha1 algorithm (or any other hash algorithm) is written such that it would be possible to do this?

The scenario where I would want to do this is with a server process that receives files from a sender over a network. The sender sends the files to the server process in blocks. Ideally the server process could calculate the hash on each block as it is sent instead of calculating the hash on the entire file after all blocks have been sent.

My thinking is that it would be less of a housekeeping headache for me to store the intermediate hash values server side, and use the intermediate hash values to initialize a new sha1 structure vs. trying to persist sha1 data structures between block uploads.

The server code would look something like this (very psuedocode-ish):

sub receiveBlock(block() as byte)

  sha1 = New SHA1CryptoServiceProvider

  if <not the first block of the file> then

    ' 
    ' This is the part I don't know how to do
    '
    sha1.inializeWithPriorHashValue(lastHash)

  end if

  sha1.TransformBlock(block, 0, block.length, block, 0)
  rememberHashValue(sha1.hash)
  writeToServerDisk(block)

end

sub receiveFinalBlock(block() as byte)

  sha1 = New SHA1CryptoServiceProvider


  '
  ' This is the part I don't know how to do
  '
  sha1.inializeWithPriorHashValue(lastHash)


  sha1.TransformBlock(block, 0, block.length, block, 0)
  rememberHashValue(sha1.hash)
  writeServerDisk(block)

end
john