tags:

views:

65

answers:

1

I want to calculate the checksum for a large tiff file that might not fit in memory. Will I get a reliable value if I instead calculate the checksum of every page and then calculate the checksum of the array of page checksums or will I run into a mathematical problem that I am not seeing and the only correct way to do it is to in fact work with the whole thing?

Thanks!

A: 

I don't know if understood the question correctly, but with most checksum algorithms you only have to load a small part of the message to memory. Because of that operating on the streams instead of memory locations is possible and has been done before.

Edit:

I only know that you have to be careful with Adler-32 when checksumming short messages, you would not be covering the whole hash space and false positives are more likely (yest, the array of checksums would probably be a short message).

With crypto hashes I honestly don't know. My intuition is that md5(msg1 + msg2 + ...) is as reliable as md5(md5(msg1) + md5(msg2) + ...) but we'll have to wait for someone smarter than me to give definitive answer :)

wuub
Thank you for the link - but how far off would I be if I calculate the chechsum of the page checksums, instead of the checksum of the whole multipage document?
Otávio Décio