views:

1763

answers:

7

My understanding is that a hash code and checksum are similar things - a numeric value, computed for a block of data, that is relatively unique.

i.e. The probability of two blocks of data yielding the same numeric hash/checksum value is low enough that it can be ignored for the purposes of the application.

So do we have two words for the same thing, or are their important differences between hash codes and checksums?

+12  A: 

I would say that a checksum is necessarily a hashcode. However, not all hashcodes make good checksums.

A checksum has a special purpose --- it verifies or checks the integrity of data (some can go beyond that by allowing for error-correction). "Good" checksums are easy to compute, and can detect many types of data corruptions (e.g. one, two, three erroneous bits).

A hashcode simply describes a mathematical function that maps data to some value. When used as a means of indexing in data structures (e.g. a hash table), a low collision probability is desirable.

Zach Scrivena
I would rather say: a hashcode is a checksum.
Gumbo
Maybe one could be used as the other, but considering that they have different design goals this just confuses the issue.
Wim Coenen
+7  A: 

Wikipedia puts it well:

Checksum functions are related to hash functions, fingerprints, randomisation functions, and cryptographic hash functions. However, each of those concepts has different applications and therefore different design goals. Check digits and parity bits are special cases of checksums, appropriate for small blocks of data (such as Social Security numbers, bank account numbers, computer words, single bytes, etc.). Some error-correcting codes are based on special checksums that not only detect common errors but also allow the original data to be recovered in certain cases.

Jon Skeet
After reading that, I'm still wondering what the difference is.
kirk.burleson
A: 

These days they are interchangable, but in days of yore a checksum was a very simple techique where you'd add all the data up (usually in bytes) and tack a byte on the end with that value in.. then you'd hopefully know if any of the original data had been corrupted. Similar to a check bit, but with bytes.

Steven Robbins
A: 

I tend to use the word checksum when referring to the code (numeric or otherwise) created for a file or piece of data that can be used to check that the file or data has not been corrupted. The most common usage I come across is to check that files sent across the network have not been altered (deliberately or otherwise).

Ian1971
+4  A: 

There is a different purpose behind each of them:

  • Hash code - designed to be random across its domain (to minimize collisions in hash tables and such). Cryptographic hash codes are also designed to be computationally infeasible to reverse.
  • Check sum - designed to detect the most common errors in the data and often to be fast to compute (for effective checksumming fast streams of data).

In practice, the same functions are often good for both purposes. In particular, a cryptographically strong hash code is a good checksum (it is almost impossible that a random error will break a strong hash function), if you can afford the computational cost.

Rafał Dowgird
+4  A: 

There are indeed some differences:

  • Checksums just need to be different when the input is different (as far as possible), but it's almost as important that they're fast to compute.
  • Hash codes (for use in hashtables) have the same requirements, and additionally they should be evenly distributed across the code space, especially for inputs that are similar.
  • Cryptographic hashes have the much more stringent requirement that given a hash, you cannot construct an input that produces this hash. Computation times comes second.
Michael Borgwardt
+1  A: 

Hashcodes and checksums are both used to create short numerical value from a data item. The difference is that a checksum value should change, even if a small modification is made to the data item. For a hash value, the requirement is merely that real-world data items should have distinct hash values.

A clear example are strings. A checksum for a string should include each and every bit, and order matters. A hashcode on the other hand can often be implemented as a checksum of a limited-length prefix. That would mean that "aaaaaaaaaaba" would hash the same as "aaaaaaaaaaab", but hash algorithms can deal wth such collisions.

MSalters