views:

2610

answers:

9

Which checksum algorithm can you recommend in the following use case?

I want to generate checksums of small JPEG files (~8 kB each) to check if the content changed. Using the filesystem's date modified is unfortunately not an option.
The checksum need not be cryptographically strong but it should robustly indicate changes of any size.

The second criterion is speed since it should be possible to process at least hundreds of images per second (on a modern CPU).

The calculation will be done on a server with several clients. The clients send the images over Gigabit TCP to the server. So there's no disk I/O as bottleneck.

+7  A: 

If you have many small files, your bottleneck is going to be file I/O and probably not a checksum algorithm.

A list of hash functions (which can be thought of as a checksum) can be found here.

Is there any reason you can't use the filesystem's date modified to determine if a file has changed? That would probably be faster.

luke
Agreed. Most non-cryptographically-secure hash algorithms are short enough that there's an appreciable chance of false negatives, so I would simply use MD5 or SHA1 and be done with it. Computation speed is unlikely to be an issue.
Nick Johnson
A: 

CRC

radu_c
CRC is not cryptographically strong
sergdev
Which makes it suitable for this application.
erickson
the OP didn't ask for a cryptographically strong hash algorithm :)
radu_c
+3  A: 
  • CRC-32 comes into mind mainly because it's cheap to calculate

  • Any kind of I/O comes into mind mainly because this will be the limiting factor for such an undertaking ;)

  • The problem is not calculating the checksums, the problem is to get the images into memory to calculate the checksum.

  • I would suggest "stagged" monitoring:

    • stage 1: check for changes of file timestamps and if you detect a change there hand over to...

      (not needed in your case as described in the edited version)

    • stage 2: get the image into memory and calculate the checksum

  • For sure important as well: multi-threading: setting up a pipeline which enables processing of several images in parallel if several CPU cores are available.

pointernil
A: 

There are lots of fast CRC algorithms that should do the trick: http://www.google.com/search?hl=en&q=fast+crc&aq=f&oq=

Edit: Why the hate? CRC is totally appropriate, as evidenced by the other answers. A Google search was also appropriate, since no language was specified. This is an old, old problem which has been solved so many times that there isn't likely to be a definitive answer.

Mark Ransom
+1  A: 

alder32, available in the zlib headers, is advertised as being significantly faster than crc32, while being only slightly less accurate.

Josh Matthews
The Adler checksum is very weak for small amounts of data, less than a few hundred bytes. The SCTP protocol originally used Adler-32 as its checksum but had to switch to CRC-32 for this reason (in RFC 3309).Though the questioners average size is 8K, for the smallest files it could be a problem.
DGentry
That's good to know, thanks!
Josh Matthews
+2  A: 

CRC32 is probably good enough, although there's a small chance you might get a collision, such that a file that has been modified might look like it hasn't been because the two versions generate the same checksum. To avoid this possibility I'd therefore suggest using MD5, which will easily be fast enough, and the chances of a collision occurring is reduced to the point where it's almost infinitessimal.

As others have said, with lots of small files your real performance bottleneck is going to be I/O so the issue is dealing with that. If you post up a few more details somebody will probably suggest a way of sorting that out as well.

Bart Read
+1  A: 

Your most important requirement is "to check if the content changed".

If it most important that ANY change in the file be detected, MD-5, SHA-1 or even SHA-256 should be your choice.

Given that you indicated that the checksum NOT be cryptographically good, I would recommend CRC-32 for three reasons. CRC-32 gives good hamming distances over an 8K file. CRC-32 will be at least an order of magnitude faster than MD-5 to calculate (your second requirement). Sometimes as important, CRC-32 only requires 32 bits to store the value to be compared. MD-5 requires 4 times the storage and SHA-1 requires 5 times the storage.

BTW, any technique will be strengthened by prepending the length of the file when calculating the hash.

Epsilon
+2  A: 

According to the Wiki page pointed to by Luke, MD5 is actually faster than CRC32!

I have tried this myself by using Python 2.6 on Windows Vista, and got the same result.

Here are some results:

crc32: 162.481544276 MBps md5: 224.489791549 MBps

crc32: 168.332996575 MBps md5: 226.089336532 MBps

crc32: 155.851515828 MBps md5: 194.943289532 MBps

I am thinking about the same question as well, and I'm tempted to use the Rsync's variation of Adler-32 for detecting file differences.

That's unusual. Is the quality of the implementation of CRC32 and MD5 equivalent, or does one use every compiler optimization and lookup table trick there is, while the other one doesn't?
Ants
+1  A: 

Just a postscript to the above; jpegs use lossy compression and the extent of the compression may depend upon the program used to create the jpeg, the colour pallette and/or bit-depth on the system, display gamma, graphics card and user-set compression levels/colour settings. Therefore, comparing jpegs built on different computers/platforms or using different software will be very difficult at the byte level.

thanks for this comment. I already took this fact into acount.
bene