views:

236

answers:

5

Is there any efficiency analysis of how MD5 dependent on the file size. Is it actually dependent of file size or content of the file. So for i have 500mb file with all blank spaces and a 500mb file with movie in it, would md5 take same time to generate the the hash code?

+4  A: 

Any hashsum is, by definition, a mathematical sum of the bytes of what you're summing. You have to read the file through a stream at the very least - more bytes take longer to traverse. However, I'd say (generally speaking) the bottleneck will indeed be reading the file, no matter what you're trying to with it - not hashing it once you've read it.

Edit: I kinda misread the question. It will take exactly the same amount of time to hash two files of equal size. 500mb of spaces is 500mb of bytes which represent "space". That's still 8 bits of data per byte, same as any other file.

Rex M
+3  A: 

All hashes in general, and including MD5, do not have performance dependent upon the content.

Will
How would it be possible for a computer to traverse a short byte array in exactly the same amount of time as a long one?
Rex M
Er, I see what you're saying now. One byte is a space, the other byte is not - same size, different content. Nevermind :)
Rex M
+1  A: 
gahooa
Those images might vanish if the page is edited later - you might want to mirror them somewhere...
bdonlan
+1  A: 

Here's a quick empirical test.

# dd if=/dev/urandom of=randomfile bs=1024 count=512000
# dd if=/dev/zero of=zerofile bs=1024 count=512000

# time md5 randomfile 
MD5 (randomfile) = bb318fa1561b17e30d03b12e803262e4

real    0m2.753s
user    0m1.567s
sys 0m1.157s

# time md5 zerofile
MD5 (zerofile) = d8b61b2c0025919d5321461045c8226f

real    0m2.761s
user    0m1.567s
sys 0m1.168s

This is expected as per previous answers alluding to the bit manipulations used in the MD5 algorithm.

Robert Duncan
A: 

MD5, like most other hash algorithms, operates on blocks. For each 512-bit block of the input it performs the same operation and uses the output as part of the input for the next block.

The operation consists of the same basic operations (XOR, AND, NOT etc.). On all processors that I know, these operations will take the same time, no matter what the arguments are. So the time MD5 should take to process input should be linear in the number of 512-bit blocks in the input.

Rasmus Faber