views:

176

answers:

5

I was working with quality yesterday doing some formal testing. In their procedure they were verifying all files on the test machine were pulled from the release. The way they were verifying these files were the same was by checking the size and the date/time stamp windows put on them in explorer. These happened to be off for another reason which I was able to find out why. But my question is this: Is this a valid way to verify a file is the same? I would think not and started to argue, but I am younger here so thought I shouldn't push it too far. I wanted to argue they should do a binary compare on the file to verify its contents are exact. In my experience time/date stamps and size attributes don't always act as expected. Any thoughts???

A: 

You should do a CRC check on each file... from the wiki:

Cyclic redundancy check, a type of hash function used to produce a checksum, in order to detect errors in transmission or storage.

It produces an almost unique value based on the contents of the file.

GateKiller
CRC-32 only has good hamming distances for fairly small files (< 128K) over that size the do not have enough entropy to be used reliably for file comparison.
Epsilon
+1  A: 

I would do something like an md5sum hash on the files and compare that to the known hashes from the release. They will be more accurate than just date/time comparisons and should be able to be automated more.

Ryan Ahearn
+1  A: 

The normal way is to compute a hash of the two files and compare that. MD5 and SHA1 are typical hash algorithms. md5sum should be installed by default on most unix type machines, and Wikipedia's md5sum article has links to some windows implementations.

Kieron
+3  A: 

The only 100% way to figure out if two files are equal is to do a binary comparison of the two.

If you can live with the risk of false positives (ie. two files which aren't 100% identical but your code says they are), then the digest and checksum algorithms can be used to lessen the work, particularly if the files lives on two different machines with less than optimal bandwidth so that a binary comparison is infeasible.

The digest and checksum algorithms all have chances of false positives, but the exact chance varies with the algorithm. General rule is that the more crypto-made it is, and the more bits it outputs, the less chance of a false positive.

Even the CRC-32 algorithm is fairly good to use and it should be easy to find code examples on the internet that implements it.

If you only do a size/timestamp comparison then I'm sorry to say that this is easy to circumvent and won't actually give you much of a certainty that the files are the same or different.

It depends though, if you know that in your world, timestamps are kept, and only changed when the file is modified, then you can use it, otherwise it holds no guarantee.

Lasse V. Karlsen
+3  A: 

Hashing is very good. But the other, slightly lower tech alternative is to run a diff tool like WinMerge or TextWrangler and compare the two versions of each file. Boring and there's room for human error.

Best of all, use version control to ensure the files you're testing are the files you edited and the ones you're going to launch. We have checkout folders from our repo as the staging and live sites, so once you've committed the changes from your working copy, you can be 100% sure that the files you test, push to staging and then live are the same, because you just run "svn update" on each box and check the revision number.

Oh, and if you need to roll back in a hurry (it happens to us all sometime or another) you just run svn update again with the -r switch and go back to a previous revision virtually instantly.

Flubba