tags:

views:

122

answers:

3

MD5 and SHA-1 hashes have weaknesses against collision attacks. SHA256 does not but it outputs 256 bits. Can I safely take the first or last 128 bits and use that as the hash? I know it will be weaker (because it has less bits) but otherwise will it work?

Basically I want to use this to uniquely identify files in a file system that might one day contain a trillion files. I'm aware of the birthday problem and a 128 bit hash should yield about a 1 in a trillion chance on a trillion files that there would be two different files with the same hash. I can live with those odds.

What I can't live with is if somebody could easily, deliberately, insert a new file with the same hash and the same beginning characters of the file. I believe in MD5 and SHA1 this is possible.

A: 

Yes, that will work.

For the record, there are known in-use collision attacks against MD5, but the SHA-1 attacks are at this point completely theoretical (no SHA-1 collision has ever been found... yet).

BlueRaja - Danny Pflughoeft
SHA-256 (the hash which the OP is talking about) is SHA-2 though, not SHA-1 - I think? And so far no collisions have been found for SHA-2.. not even theoretically.
@blueraja- not totally true. check out: http://people.csail.mit.edu/yiqun/SHA1AttackProceedingVersion.pdf
Yuval A
@mrl33t: No; SHA-1 has theoretical vulnerabilities, but SHA-256 (which is part of the SHA-2 suite) does not even have those. Considering the size of SHA-256 hashes are 2^128 times LARGER than SHA-1, and SHA-2 is thought to be more theoretically secure, it's not likely there'll be any SHA-256 collisions any time soon.
BlueRaja - Danny Pflughoeft
@Yuval: Yes, that is the theoretical vulnerability I mentioned (actually, there is a more recent paper that reduces the search space even more). Even so, what I said was completely true: there are still no known collisions for SHA-1.
BlueRaja - Danny Pflughoeft
2^128 times larger? WOW! ;) I think you might want to check your math, or your wording...
Dan McGrath
@Dan: whoops, I meant the **search space** is **2^96** times larger, sorry (2^96 * 2^160 = 2^256)
BlueRaja - Danny Pflughoeft
A: 

But is it worth it? If you have a hash for each file, then you essentially have an overhead for each file. Let's say that each file must take up at least 512 bytes (a typical disk sector) and that you're storing these hashes compactly enough so as to not have each hash take up much more than the hash size.

So, even if all your files are 512 bytes, the smallest, you're talking either 16 / 512 = 3.1% or 32 / 512 = 6.3%. In reality, I'd bet your average file size is higher (unless all your files are 1 sector...), so that overhead would be less.

Now, the amount of space you need for hashes scales linearly with the number of files you have. Is that extra space worth that much? Even if you had your mentioned trillion files - that's 1 000 000 000 000 * 16 = ~29 TiB, which is a lot of space, but keep in mind: your data would be 1 000 000 000 000 * 512 = 465 TiB. The numbers are worthless, really, since it's still 3% or 6% overhead. But at this level, where you have a half petabyte of storage, does 15 terabytes matter? At any level, does a 3% savings mean anything? And remember, if they're larger, you save less. (Which, they probably are: good luck getting a 512 byte sector size at that hard disk size.)

So, is this 3% or less disk savings worth the potential risk in security. (Which I'll leave unanswered, as it's waaay not my cup of tea.)

Alternatively, could you, say, group files together in some logical fashion, so that you have less files? (I mean, if you have trillions of 512 byte files, do you really want to hash every byte on disk?)

Thanatos
+2  A: 

Yeah that will work. Theoretically it's better to XOR the two halves together but even truncated SHA256 is stronger than MD5. You should still consider the result a 128 bit hash rather than a 256 bit hash though.

My particular recommendation in this particular case is to store and reference using HASH + uniquifier where uniquifier is the count of how many distinct files you've seen with this hash before. This way you don't absolutely fall down flat if somebody tries to store future discovered collision vectors for SHA256.

Joshua
I can find no reference that says it is theoretically better to XOR the halves together, and I'm skeptical that it is. Interesting idea with the uniquifier.
GregS
GregS: some of the early attacks on MD5 resulted in collisions on most of the hash with one or two cells different.
Joshua