We have a very old, unsupported program which copies files across SMB shares. It has a checksum algorithm to determine if the file contents have changed before copying. The algorithm seems easily fooled -- we've just found an example where two files, identical except a single '1' changing to a '2', return the same checksum. Here's the algorithm:
unsigned long GetFileCheckSum(CString PathFilename)
{
FILE* File;
unsigned long CheckSum = 0;
unsigned long Data = 0;
unsigned long Count = 0;
if ((File = fopen(PathFilename, "rb")) != NULL)
{
while (fread(&Data, 1, sizeof(unsigned long), File) != FALSE)
{
CheckSum ^= Data + ++Count;
Data = 0;
}
fclose(File);
}
return CheckSum;
}
I'm not much of a programmer (I am a sysadmin) but I know an XOR-based checksum is going to be pretty crude. What're the chances of this algorithm returning the same checksum for two files of the same size with different contents? (I'm not expecting an exact answer, "remote" or "quite likely" is fine.)
How could it be improved without a huge performance hit?
Lastly, what's going on with the fread()
? I had a quick scan of the documentation but I couldn't figure it out. Is Data
being set to each byte of the file in turn? Edit: okay, so it's reading the file into unsigned long
(let's assume a 32-bit OS here) chunks. What does each chunk contain? If the contents of the file are abcd
, what is the value of Data
on the first pass? Is it (in Perl):
(ord('a') << 24) & (ord('b') << 16) & (ord('c') << 8) & ord('d')