views:

680

answers:

8

We have a very old, unsupported program which copies files across SMB shares. It has a checksum algorithm to determine if the file contents have changed before copying. The algorithm seems easily fooled -- we've just found an example where two files, identical except a single '1' changing to a '2', return the same checksum. Here's the algorithm:

unsigned long GetFileCheckSum(CString PathFilename)
{
        FILE* File;
        unsigned long CheckSum = 0;
        unsigned long Data = 0;
        unsigned long Count = 0;

        if ((File = fopen(PathFilename, "rb")) != NULL)
        {
                while (fread(&Data, 1, sizeof(unsigned long), File) != FALSE)
                {
                        CheckSum ^= Data + ++Count;
                        Data = 0;
                }
                fclose(File);
        }
        return CheckSum;
}

I'm not much of a programmer (I am a sysadmin) but I know an XOR-based checksum is going to be pretty crude. What're the chances of this algorithm returning the same checksum for two files of the same size with different contents? (I'm not expecting an exact answer, "remote" or "quite likely" is fine.)

How could it be improved without a huge performance hit?

Lastly, what's going on with the fread()? I had a quick scan of the documentation but I couldn't figure it out. Is Data being set to each byte of the file in turn? Edit: okay, so it's reading the file into unsigned long (let's assume a 32-bit OS here) chunks. What does each chunk contain? If the contents of the file are abcd, what is the value of Data on the first pass? Is it (in Perl):

(ord('a') << 24) & (ord('b') << 16) & (ord('c') << 8) & ord('d')
+2  A: 

You could easily improve the algorithm by using a formula like this one:

Checksum = (Checksum * a + Data * b) + c;

If a, b and c are large primes, this should return good results. After this, rotating (not shifting!) the bits of checksum will further improve it a bit.

Using primes, this is a similar algorithm to that used for Linear congruential generators - it guarantees long periods and good distribution.

schnaader
I am not sure how that helps the distribution! It helps in hardening for malisious attack.
Martin York
Assuming the files are a lot of ASCII text, this will ensure that you're not always XORing about 5 bytes of variance together, and will scatter the entropy through the checksum.
Paul Fisher
A: 

I seems like your algorithm makes no effort to deal with files that are not an exact multiple of 4 bytes in size. The return value of fread is not a boolean but the number of bytes actually read, which will differ from 4 in the case of an EOF or if an error occurred. You are checked for neither, but simply assuming that if it didn't return 0, you have 4 valid bytes in 'data' which which to calculate your hash.

If you really want to use a hash, I'd recommend several things. First, use a simple cryptographic hash like MD5, not CRC32. CRC32 is decent for checking data validity, but for spanning a file system and ensuring no collisions, its not as great a tool because of the birthday paradox mentioned in the comments elsewhere. Second, don't write the function yourself. Find an existing implementation. Finally, consider simply using rsync to replicate files instead of rolling your own solution.

Jherico
I think (assuming no errors and a file length > sizeof(long)) the hash will return consistent results as the last bits of the last read will consistently get held over from the last iteration.
BCS
That's relying on 2 flaws in the code to ensure proper functionality. If someone added code to reset 'data' to 0 between each loop, it would also produce consistent results, but now any previously stored values for CRC's are incorrect.
Jherico
Actually, Martin's deleted answer was on the right track. In this application, you are trying to detect whether two specific files are the same, not whether one file matches any in a collection. So, the birthday problem is not applicable.
erickson
Collisions across files in a filesystem aren't important -- all we need to know is, "has file x changed?" If yes, copy it, if no, don't. We can't use rsync for various reasons.
markdrayton
'Can't use rsync for various reasons' is a little odd since its supported on pretty much every platform that also supports samba or windows shares, but OK. My comment about implementation and use of fread() stands.
Jherico
@BCS: Actually, while it may be correct that the current implementation is consistent, thats only true if fread() doesn't modify any bytes in the output pointer other than the ones it reads. Granted, the most obvious implementation will probably do that, but fread() doesn't make any such promise.
Jherico
@Jhericho: we can't use rsync because we're dealing with a 100Gb+ filesystem with millions of files. When a file's changed it needs to be propagated to many other servers (100+) quickly so we trigger the replication of the specific file at the same time as the change is made. Sometimes unchanged files are pushed too, hence the checksumming. Rsync would probably choke with > 100Gb and would definitely take an unacceptably long time to propagate changes. Our solution is inelegant and could be greatly improved but for the moment it's what we're stuck with.
markdrayton
Even if you're dealing with a list of 'potentially changed' files, you can feed that list to rsync and tell rsync to use checksumming to determine if the file has changed. You're still re-inventing the wheel.
Jherico
A: 

The fread bit is reading in the file one chunk at a time. Each chunk is the size of a long (in c this is not a well defined size but you can assume 32 or 64 bits). Depending on how it gets buffered, this might not be to bad. OTOH, reading a larger chunk into an array and looping over it might be a lot faster.

BCS
+6  A: 

MD5 is commonly used to verify the integrity of transfer files. Source code is readily available in c++. It is widely considered to be a fast and accurate algorithm.

See also http://stackoverflow.com/questions/122982/robust-and-fast-checksum-algorithm

Robert Harvey
Side note: MD5 is only good for verifying file integrity from a trusted source. It is possible to create two files with the same MD5 checksum, provided it is done in advance and both are made at the same time.But, it is prohibitively difficult to make a file with the same MD5 as another.
rlbond
If you don't care about crypto strength, MD4 is a little simpler and faster, and CRC-32 is noticeably simpler and faster, than MD5. None of them will equal the speed of OP's (broken) "checksum" though.
ephemient
I think MD4 and CRC32 will equal the speed, since they're all likely I/O bound. With modern CPUs even MD5 might be I/O bound.
MSalters
Ah, OP's doing 4-byte reads, that will almost definitely be I/O bound. Well, if you're working in memory only, I think my ranking holds though.
ephemient
A: 

Even "expensive" cryptographic hash functions usually require multiple iterations to take significant amounts of time. Although no longer recommended for cryptographic purposes, where users would deliberately try to create collisions, functions like SHA1 and MD5 are widely available and suitable for this purpose.

If a smaller hash value is needed, CRC is alright, but not great. A n-bit CRC will fail to detect a small fraction of changes that are longer than n bits. For example, suppose just a single dollar amount in a file is changed, from $12,345 to $34,567. A 32-bit CRC might miss that change.

Truncating the result of a longer cryptographic hash will detect changes more reliably than a CRC.

erickson
A: 
{
   CheckSum ^= Data + ++Count;
   Data = 0;
}

I don't think "++Count" do much work. The code is equivalent with

{
  CheckSum ^= Data;
}

XORing a sequence of bytes is not enough. Especially with text files.

I suggest to use a hash function.

Nick D
Well, ++Count does much work, for example it prevents trivial collisions like chk(ABCD) = chk(DCBA) which will occur when using ^= Data (A, B, C, D are meant to be unsigned longs here)
schnaader
Ok, but note that it affects only the 4th byte in each round, the 3rd every 256 loops, the 2nd every 65536, etc.
Nick D
+4  A: 

I'd suggest you take a look at Fletcher's checksum, specifically fletcher-32, which ought to be fairly fast, and detect various things the current XOR chain would not.

Hasturkun
A: 

SHA-1 and (more recently SHA-2) provide excellent hashing functions and I believe as slowly supplanting MD5 due to better hashing properties. All of them (md2, sha, etc...) have efficient implementations and return a hash of a buffer that is several characters long (although always a fixed length). are provably more reliable than reducing a hash to an integer. If I had my druthers, I'd use SHA-2. Follow this link for libraries that implement SHA checksums.

If you don't want to compile in those libraries, linux (and probably cygwin) has the following executables: md5sum, sha1sum, sha224sum, sha256sum, sha384sum, sha512sum; to which you can provide your file and they will print out the checksum as a hex string. You can use popen to execute those programs -- with something like this:

const int maxBuf=1024;
char buf[maxBuf];
FILE* f = popen( "sha224sum myfile", "w" );
int bytesRead = f.read( buf, maxBuf );
fclose( f );

Obviously this will run quite a lot slower, but makes for a useful first pass. If speed is an issue, given that file hashing operations like this and I/O bound (memory and disk access will be you bottlenecks), I'd expect all of this algorithms to run about as fast a one that produces an unsigned int. Perl and Python also come with implementations of MD5 SHA1 and SHA2 and will probably run as fast as in C/C++.

SHA-1 has better cryptographic properties than MD-5. That's irrelevant here.
MSalters
Maybe. Depends on the application. There's a discussion of the (cryptographic) problems with MD5 here. http://en.wikipedia.org/wiki/MD5(It also says that SHA1 is broken for cryptographic purposes, BTW.)