views:

313

answers:

5

When calculating a single MD5 checksum on a large file, what technique is generally used to combine the various MD5 values into a single value? Do you just add them together? I'm not really interested in any particular language, library or API which will do this; rather I'm just interested in the technique behind it. Can someone explain how it is done?

Given the following algorithm in pseudo-code:

MD5Digest X
for each file segment F
   MD5Digest Y = CalculateMD5(F)
   Combine(X,Y)

But what exactly would Combine do? Does it add the two MD5 digests together, or what?

+2  A: 

The openSSL library allows you to add blocks of data to a ongoing hash (sha1/md5) then when you have finished adding all the data you call the Final method and it will output the final hash.

You don't calculate md5 on each individual block then add it, rather you add the data to the ongoing hash method from the openssl library. This will then give you an md5 hash of all the individual data blocks with no limit on the input data size.

http://www.openssl.org/docs/crypto/md5.html#

Harley Green
+2  A: 

This question doesn't make much sense as the MD5 algorithm takes any length input. A decent library should have functions so that you don't have to add the entire message at a single time as the message is broken down into blocks an hashed sequentially, with the block that is being processed depending only on the resultant hashes from the previous loop.

The pseudo code in the wikipedia article should give a overview of how the algorithm works.

Yacoby
+6  A: 

In order to calculate MD5 values for files which are too large to fit in memory

With that in mind, you don't want to "combine" two MD5 hashes. With any MD5 implementation, you have a object that keeps the current checksum state. So you can extract the MD5 checksum at any time, which is very handy when hashing two files that share the same beginning. For big files, you just keep feeding in data - there's no difference if you hash the file at once or in blocks, as the state is remembered. In both cases you will get the same hash.

AndiDog
A: 

Most digest calculation implementations allow you to feed them the data in smaller blocks. You can't combine multiple MD5 digests in a way that the result will be equal to the MD5 of the entire input. MD5 does some padding and uses the number of proccessed bytes in the final stage which makes the original engine state unrecoverable from the final digest value.

x4u
+2  A: 

MD5 is an iterative algorithm. You don't need to calculate a ton of small MD5's and then combine them somehow. You just read small chunks of the the file and add them to the digest as your're going, so you never have to have the entire file in memory at once. Here's a java implementation.

FileInputStream f = new FileInputStream(new File("bigFile.txt"));
MessageDigest digest = MessageDigest.getInstance("md5");
byte[] buffer = new byte[8192];
int len = 0;
while (-1 != (len = f.read(buffer))) {
   digest.update(buffer,0,len);
}
byte[] md5hash = digest.digest();

Et voila. You have the MD5 of an entire file without ever having the whole file in memory at once.

Its worth noting that if for some reason you do want MD5 hashes of subsections of the file as you go along (this is sometimes useful for doing interim checks on a large file being transferred over a low bandwidth connection) then you can get them by cloning the digest object at any time, like so

byte[] interimHash = ((MessageDigest)digest.clone()).digest();

This does not affect the actual digest object so you can continue to work with the overall MD5 hash.

Its also worth noting that MD5 is an outdated hash for cryptographic purposes (such as verifying file authenticity from an untrusted source) and should be replaced with something better in most circumstances, such as SHA-1. For non-cryptographic purposes, such as verifying file integrity between two trusted sources, MD5 is still adequate.

Jherico