tags:

views:

332

answers:

3

I am looking for a checksum algorithm where for a large block of data the checksum is equal to the sum of checksums from all the smaller component blocks. Most of what I have found is from RFCs 1624/1141 which do provide this functionality. Does anyone have any experience with these checksumming techniques or a similar one?

+6  A: 

I have only used Adler/Fletcher checksums which work as you describe.

There is a nice comparison of crypto++ hash/checksum implementations here.

Rob Elliott
+4  A: 

To answer Amigable Clark Kent's bounty question, for file identity purposes you probably want a cryptographic hash function, which tries to guarantee that any two given files have an extremely low probability of producing the same value, as opposed to a checksum which is generally used for error detection only and may provide the same value for two very different files.

Many cryptographic hash functions, such as MD5 and SHA-1, use the Merkle–Damgård construction, in which there is a computation to compress a block of data into a fixed size, and then combine that with a fixed size value from the previous block (or an initialization vector for the first block). Thus, they are able to work in a streaming mode, incrementally computing as they go along.

Brian Campbell
+5  A: 

If it's just a matter of quickly combining the checksums of the smaller blocks to get to the checksums of the larger message (not necessarily by a plain summation) you can do this with a CRC-type (or similar) algorithm.

The CRC-32 algorithm is as simple as this:

uint32_t update(uint32_t state, unsigned bit)
{
    if (((state >> 31) ^ bit) & 1) state = (state << 1) ^ 0x04C11DB7;
    else                           state = (state << 1);
    return state;
}

Mathematically, the state represents a polynomial over the field GF2 that is always reduced modulo the generator polynomial. Given a new bit b the old state is transformed into the new state like this

state --> (state * x^1 + b * x^32) mod G

where G is the generator polynomial and addition is done in GF2 (xor). This checksum is linear in the sense that you can write the message M as a sum (xor) of messages A,B,C,... like this

  10110010 00000000 00000000 = A =    a     00000000 00000000
  00000000 10010001 00000000 = B = 00000000    b     00000000
  00000000 00000000 11000101 = C = 00000000 00000000    c
-------------------------------------------------------------
= 10110010 10010001 11000101 = M =    a        b        c

with the following properties

         M  =          A  +          B  +          C
checksum(M) = checksum(A) + checksum(B) + checksum(C)

Again, I mean the + in GF2 which you can implement with a binary XOR.

Finally, it's possible to compute checksum(B) based on checksum(b) and the position of the subblock b relative to B. The simple part is leading zeros. Leading zeros don't affect the checksum at all. So checksum(0000xxxx) is the same as checksum(xxxx). If you want to compute the checksum of a zero-padded (to the right -> trailing zeros) message given the checksum of the non-padded message it is a bit more complicated. But not that complicated:

zero_pad(old_check_sum, number_of_zeros)
  := ( old_check_sum *  x^{number_of_zeros}        ) mod G
   = ( old_check_sum * (x^{number_of_zeros} mod G) ) mod G

So, getting the checksum of a zero-padded message is just a matter of multiplying the "checksum polynomial" of the non-padded message with some other polynomial (x^{number_of_zeros} mod G) that only depends on the number of zeros you want to add. You could precompute this in a table or use the square-and-multiply algorithm to quickly compute this power.

Suggested reading: Painless Guide to CRC Error Detection Algorithms

sellibitze