I am looking for a checksum algorithm where for a large block of data the checksum is equal to the sum of checksums from all the smaller component blocks. Most of what I have found is from RFCs 1624/1141 which do provide this functionality. Does anyone have any experience with these checksumming techniques or a similar one?
I have only used Adler/Fletcher checksums which work as you describe.
There is a nice comparison of crypto++ hash/checksum implementations here.
To answer Amigable Clark Kent's bounty question, for file identity purposes you probably want a cryptographic hash function, which tries to guarantee that any two given files have an extremely low probability of producing the same value, as opposed to a checksum which is generally used for error detection only and may provide the same value for two very different files.
Many cryptographic hash functions, such as MD5 and SHA-1, use the Merkle–Damgård construction, in which there is a computation to compress a block of data into a fixed size, and then combine that with a fixed size value from the previous block (or an initialization vector for the first block). Thus, they are able to work in a streaming mode, incrementally computing as they go along.
If it's just a matter of quickly combining the checksums of the smaller blocks to get to the checksums of the larger message (not necessarily by a plain summation) you can do this with a CRC-type (or similar) algorithm.
The CRC-32 algorithm is as simple as this:
uint32_t update(uint32_t state, unsigned bit)
{
if (((state >> 31) ^ bit) & 1) state = (state << 1) ^ 0x04C11DB7;
else state = (state << 1);
return state;
}
Mathematically, the state represents a polynomial over the field GF2 that is always reduced modulo the generator polynomial. Given a new bit b
the old state is transformed into the new state like this
state --> (state * x^1 + b * x^32) mod G
where G is the generator polynomial and addition is done in GF2 (xor). This checksum is linear in the sense that you can write the message M
as a sum (xor) of messages A,B,C,... like this
10110010 00000000 00000000 = A = a 00000000 00000000
00000000 10010001 00000000 = B = 00000000 b 00000000
00000000 00000000 11000101 = C = 00000000 00000000 c
-------------------------------------------------------------
= 10110010 10010001 11000101 = M = a b c
with the following properties
M = A + B + C
checksum(M) = checksum(A) + checksum(B) + checksum(C)
Again, I mean the +
in GF2 which you can implement with a binary XOR.
Finally, it's possible to compute checksum(B)
based on checksum(b)
and the position of the subblock b
relative to B
. The simple part is leading zeros. Leading zeros don't affect the checksum at all. So checksum(0000xxxx)
is the same as checksum(xxxx)
. If you want to compute the checksum of a zero-padded (to the right -> trailing zeros) message given the checksum of the non-padded message it is a bit more complicated. But not that complicated:
zero_pad(old_check_sum, number_of_zeros)
:= ( old_check_sum * x^{number_of_zeros} ) mod G
= ( old_check_sum * (x^{number_of_zeros} mod G) ) mod G
So, getting the checksum of a zero-padded message is just a matter of multiplying the "checksum polynomial" of the non-padded message with some other polynomial (x^{number_of_zeros} mod G
) that only depends on the number of zeros you want to add. You could precompute this in a table or use the square-and-multiply algorithm to quickly compute this power.
Suggested reading: Painless Guide to CRC Error Detection Algorithms