ansaurus

Question

Can this checksum algorithm be improved?

Answer 1

+2 A:

You could easily improve the algorithm by using a formula like this one:

Checksum = (Checksum * a + Data * b) + c;

If a, b and c are large primes, this should return good results. After this, rotating (not shifting!) the bits of checksum will further improve it a bit.

Using primes, this is a similar algorithm to that used for Linear congruential generators - it guarantees long periods and good distribution.

schnaader 2009-06-25 17:36:56

I am not sure how that helps the distribution! It helps in hardening for malisious attack.

Martin York 2009-06-25 17:40:36

Assuming the files are a lot of ASCII text, this will ensure that you're not always XORing about 5 bytes of variance together, and will scatter the entropy through the checksum.

Paul Fisher 2009-06-26 00:48:38

Answer 2

A:

I seems like your algorithm makes no effort to deal with files that are not an exact multiple of 4 bytes in size. The return value of fread is not a boolean but the number of bytes actually read, which will differ from 4 in the case of an EOF or if an error occurred. You are checked for neither, but simply assuming that if it didn't return 0, you have 4 valid bytes in 'data' which which to calculate your hash.

If you really want to use a hash, I'd recommend several things. First, use a simple cryptographic hash like MD5, not CRC32. CRC32 is decent for checking data validity, but for spanning a file system and ensuring no collisions, its not as great a tool because of the birthday paradox mentioned in the comments elsewhere. Second, don't write the function yourself. Find an existing implementation. Finally, consider simply using rsync to replicate files instead of rolling your own solution.

Jherico 2009-06-25 17:39:23

I think (assuming no errors and a file length > sizeof(long)) the hash will return consistent results as the last bits of the last read will consistently get held over from the last iteration.

BCS 2009-06-25 17:43:59

That's relying on 2 flaws in the code to ensure proper functionality. If someone added code to reset 'data' to 0 between each loop, it would also produce consistent results, but now any previously stored values for CRC's are incorrect.

Jherico 2009-06-25 17:50:54

Actually, Martin's deleted answer was on the right track. In this application, you are trying to detect whether two specific files are the same, not whether one file matches any in a collection. So, the birthday problem is not applicable.

erickson 2009-06-25 17:55:47

Collisions across files in a filesystem aren't important -- all we need to know is, "has file x changed?" If yes, copy it, if no, don't. We can't use rsync for various reasons.

markdrayton 2009-06-25 18:27:17

'Can't use rsync for various reasons' is a little odd since its supported on pretty much every platform that also supports samba or windows shares, but OK. My comment about implementation and use of fread() stands.

Jherico 2009-06-25 18:35:51

@BCS: Actually, while it may be correct that the current implementation is consistent, thats only true if fread() doesn't modify any bytes in the output pointer other than the ones it reads. Granted, the most obvious implementation will probably do that, but fread() doesn't make any such promise.

Jherico 2009-06-25 18:37:28

@Jhericho: we can't use rsync because we're dealing with a 100Gb+ filesystem with millions of files. When a file's changed it needs to be propagated to many other servers (100+) quickly so we trigger the replication of the specific file at the same time as the change is made. Sometimes unchanged files are pushed too, hence the checksumming. Rsync would probably choke with > 100Gb and would definitely take an unacceptably long time to propagate changes. Our solution is inelegant and could be greatly improved but for the moment it's what we're stuck with.

markdrayton 2009-06-25 20:33:46

Even if you're dealing with a list of 'potentially changed' files, you can feed that list to rsync and tell rsync to use checksumming to determine if the file has changed. You're still re-inventing the wheel.

Jherico 2009-06-25 22:47:55

Answer 3

A:

The fread bit is reading in the file one chunk at a time. Each chunk is the size of a long (in c this is not a well defined size but you can assume 32 or 64 bits). Depending on how it gets buffered, this might not be to bad. OTOH, reading a larger chunk into an array and looping over it might be a lot faster.

BCS 2009-06-25 17:41:28

Answer 4

+6 A:

MD5 is commonly used to verify the integrity of transfer files. Source code is readily available in c++. It is widely considered to be a fast and accurate algorithm.

See also http://stackoverflow.com/questions/122982/robust-and-fast-checksum-algorithm

Robert Harvey 2009-06-25 17:43:31

Side note: MD5 is only good for verifying file integrity from a trusted source. It is possible to create two files with the same MD5 checksum, provided it is done in advance and both are made at the same time.But, it is prohibitively difficult to make a file with the same MD5 as another.

rlbond 2009-06-25 22:55:42

If you don't care about crypto strength, MD4 is a little simpler and faster, and CRC-32 is noticeably simpler and faster, than MD5. None of them will equal the speed of OP's (broken) "checksum" though.

ephemient 2009-06-26 02:26:26

I think MD4 and CRC32 will equal the speed, since they're all likely I/O bound. With modern CPUs even MD5 might be I/O bound.

MSalters 2009-06-26 10:33:11

Ah, OP's doing 4-byte reads, that will almost definitely be I/O bound. Well, if you're working in memory only, I think my ranking holds though.

ephemient 2009-06-26 15:08:09

Answer 5

A:

Even "expensive" cryptographic hash functions usually require multiple iterations to take significant amounts of time. Although no longer recommended for cryptographic purposes, where users would deliberately try to create collisions, functions like SHA1 and MD5 are widely available and suitable for this purpose.

If a smaller hash value is needed, CRC is alright, but not great. A n-bit CRC will fail to detect a small fraction of changes that are longer than n bits. For example, suppose just a single dollar amount in a file is changed, from $12,345 to $34,567. A 32-bit CRC might miss that change.

Truncating the result of a longer cryptographic hash will detect changes more reliably than a CRC.

erickson 2009-06-25 17:44:27

Answer 6

A:

{
   CheckSum ^= Data + ++Count;
   Data = 0;
}

I don't think "++Count" do much work. The code is equivalent with

{
  CheckSum ^= Data;
}

XORing a sequence of bytes is not enough. Especially with text files.

I suggest to use a hash function.

Nick D 2009-06-25 17:50:44

Well, ++Count does much work, for example it prevents trivial collisions like chk(ABCD) = chk(DCBA) which will occur when using ^= Data (A, B, C, D are meant to be unsigned longs here)

schnaader 2009-06-25 17:53:11

Ok, but note that it affects only the 4th byte in each round, the 3rd every 256 loops, the 2nd every 65536, etc.

Nick D 2009-06-25 18:08:04

Answer 7

+4 A:

I'd suggest you take a look at Fletcher's checksum, specifically fletcher-32, which ought to be fairly fast, and detect various things the current XOR chain would not.

Hasturkun 2009-06-25 17:58:49

Answer 8

A:

SHA-1 and (more recently SHA-2) provide excellent hashing functions and I believe as slowly supplanting MD5 due to better hashing properties. All of them (md2, sha, etc...) have efficient implementations and return a hash of a buffer that is several characters long (although always a fixed length). are provably more reliable than reducing a hash to an integer. If I had my druthers, I'd use SHA-2. Follow this link for libraries that implement SHA checksums.

If you don't want to compile in those libraries, linux (and probably cygwin) has the following executables: md5sum, sha1sum, sha224sum, sha256sum, sha384sum, sha512sum; to which you can provide your file and they will print out the checksum as a hex string. You can use popen to execute those programs -- with something like this:

const int maxBuf=1024;
char buf[maxBuf];
FILE* f = popen( "sha224sum myfile", "w" );
int bytesRead = f.read( buf, maxBuf );
fclose( f );

Obviously this will run quite a lot slower, but makes for a useful first pass. If speed is an issue, given that file hashing operations like this and I/O bound (memory and disk access will be you bottlenecks), I'd expect all of this algorithms to run about as fast a one that produces an unsigned int. Perl and Python also come with implementations of MD5 SHA1 and SHA2 and will probably run as fast as in C/C++.

2009-06-25 22:00:20

SHA-1 has better cryptographic properties than MD-5. That's irrelevant here.

MSalters 2009-06-26 10:52:22

Maybe. Depends on the application. There's a discussion of the (cryptographic) problems with MD5 here. http://en.wikipedia.org/wiki/MD5(It also says that SHA1 is broken for cryptographic purposes, BTW.)

2009-06-26 18:22:21

ansaurus

tags:

views:

answers:

Can this checksum algorithm be improved?

related questions