views:

17

answers:

1

What is the most suitable hash function for file integrity checking (checksums) to detect corruption?

I need to consider the following:

Wide range of file size (1 kb to 10GB+)
Lots of different file types
Large collection of files (+/-100 TB and growing)

Do larger files require higher digest sizes (SHA-1 vs SHA 512)?

I see that the SHA-family is referred to as cryptographic hash functions. Are they ill-suited for "general purpose" use such as detecting file corruption? Will something like MD5 or Tiger be better?

If malicious tampering is also a concern, will your answer change w.r.t the most suitable hash function?

External libraries are not an option, only whats available on Win XP SP3+.

Naturally performance is also of concern.

(Please excuse my terminology if it is incorrect, my knowledge on this subject is very limited).

A: 

Any cryptographic hash function, even a broken one, will be fine for detecting accidental corruption. A given hash function may be defined only for inputs up to some limit, but for all standard hash function that limit is at least 264 bits, i.e. about 2 millions of terabytes. That's quite large.

File type has no incidence whatsoever. Hash functions operate over sequences of bits (or bytes) regardless of what those bits represent.

Hash function performance is unlikely to be an issue. Even the "slow" hash functions (e.g. SHA-256) will run faster on a typical PC than the harddisk: reading the file will be the bottleneck, not hashing it (a 2.4 GHz PC can hash data with SHA-512 at a speed close to 200 MB/s, using a single core). If hash function performance is an issue, then either your CPU is very feeble, or your disks are fast SSD (and if you have 100 TB of fast SSD then I am kind of jealous). In that case, some hash functions are somewhat faster than other, MD5 being one of the "fast" functions (but MD4 is faster, and it is simple enough that its code can be included in any application without much hassle).

If malicious tampering is a concern, then this becomes a security issue, and that's more complex. First, you will like to use one of the cryptographically unbroken hash function, hence SHA-256 or SHA-512, not MD4, MD5 or SHA-1 (the weaknesses found in MD4, MD5 and SHA-1 might not apply to a specific situation, but this is a subtle matter and it is better to play safe). Then, hashing may or may not be sufficient, depending on whether the attacker has access to the hash results. Possibly, you may need to use a MAC, which can be viewed as a kind of keyed hash. HMAC is a standard way of building a MAC out of a hash function. There are other non-hash-based MAC. Moreover, a MAC uses a secret "symmetric" key, which is not appropriate if you want some people to be able to verify the file integrity without being able to perform silent alterations; in that case, you would have to resort to digital signatures. To be brief, in a security context, you need a thorough security analysis with a clearly defined attack model.

Thomas Pornin
Excellent answer! Thanks so much.
links77