tags:

views:

95

answers:

3

Let's say we have a billion unique images, one megabyte each. We calculate the SHA-256 hash for the contents of each file. The possibility of collision depends on:

  • the number of files
  • the size of the single file

How far we can go ignoring this possibility, assuming it is zero?

+1  A: 

First of all, it is not zero, but very close to zero.

The key question is what happens if a collision actually occurs? If the answer is "a nuclear power plant will explode" then you likely shouldn't ignore the collision possibility. In most cases the consequences are not that dare and so you can ignore the collision possibility.

Also don't forget that you software (or a tiny part of it) might be deployed and simultaneously used in a gazillion of computers (some tiny embedded microcomputers that are almost everywhere nowadays included). In such case you need to multiply the estimate you've got by the largest possible number of copies.

sharptooth
+2  A: 

The possibility of a collision does not depend on the size of the files, only on their number.

This is an example of the birthday paradox. The Wikipedia page gives an estimate of the likelihood of a collision. If you run the numbers, you'll see that all harddisks ever produced on Earth can't hold enough 1MB files to get a likelihood of a collision of even 0.01% for SHA-256.

Basically, you can simply ignore the possibility.

Michael Borgwardt
I can't agree with the conclusion. Yes, no hardisks can store that number of files, but you IMO misinterpret the situation. It only takes two files to produce a collision. Although the possibility is very low it still can happen.
sharptooth
@sharptooth: no, I'm not misrepresenting the situation. The possibility of you and everyone you know dying from a road accident on the same day is very low, but it still can happen (and it is much higher than that of an SHA-256 collision). Yet you are ignoring that possibility.
Michael Borgwardt
@Michael Borgwardt: You're right, but I don't ignore that possibility - I take certain steps that make the possibility lower. Also in case of a road accident the head count very rarely exceeds say one hundred people, but in case of a nuclear plant disaster head count can easily exceed hundreds thousands of people.
sharptooth
@sharptooth: I was talking about *separate*, simultaneous road accidents of a few hundred specific people. You can't really take any stepts to make that lower. It would be pointless, as it's already bizarrely low. But still so much more likely than an SHA-256 collision that you can't even imagine how much. It's the same argument as Thomas made.
Michael Borgwardt
@Michael Borgwardt: I gave this problem a thought and I still believe you're wrong. You don't take the number of program instances into account. Yes, maybe a chance of the program malfunctioning is very low, but if you have a gazillion copies - say on every plastic bag and on every tiny manufactured thing chances grow significantly. I don't say that every such defect is automatically a showstopper, but the number of deployed copies should be taken into account.
sharptooth
@sharptooth: No, the chances do *not* grow significantly, because the number is still absolutely dwarfed by the size of the SHA-256 hash space. This is the one thing you're not taking into account properly - all the factors have to be weighted by their actual magnitude, not equally. If you generated one billion hashes per second for every single person on Earth, and did that for one thousand years, you'd still have less than 1% chance of a collision.
Michael Borgwardt
@Michael Borgwardt: Yes, I understand that, but those factors *need to be taken into account*. Just saying that probability of one hash collision is very low regardless of the number of program copies and computation rate is not enough.
sharptooth
+4  A: 

The usual answer goes thus: what is the probability that a rogue asteroid crashes on Earth within the next second, obliterating civilization-as-we-know-it, and killing off a few billion people ? It can be argued that any unlucky event with a probability lower than that is not actually very important.

If we have a "perfect" hash function with output size n, and we have p messages to hash (individual message length is not important), then probability of collision is about p2/2n+1 (this is an approximation which is valid for "small" p, i.e. substantially smaller than 2n/2). For instance, with SHA-256 (n=256) and one billion messages (p=109) then the probability is about 4.3*10-60.

A mass-murderer space rock happens about once every 30 million years on average. This leads to a probability of such an event occurring in the next second to about 10-15. That's 45 orders of magnitude more probable than the SHA-256 collision. Briefly stated, if you find SHA-256 collisions scary then your priorities are wrong.

In a security setup, where an attacker gets to choose the messages which will be hashed, then the attacker may use substantially more than a billion messages; however, you will find that the attacker's success probability will still be vanishingly small. That's the whole point of using a hash function with a 256-bit output: so that risks of collision can be neglected.

Of course, all of the above assumes that SHA-256 is a "perfect" hash function, which is far from being proven. Still, SHA-256 seems quite robust.

Thomas Pornin
Sounds like good reasoning. What is a "space rock" - some body that is flying out in space and can hit the Earth?
sharptooth
This is a very good answer, thanks!But, if in case of collision a nuclear power plant will explode, and it depends on you, will you take that risk? If you are completely right, then we can take the risk, because it is 45 orders of magnitude more probable the civilization to be destroyed. Right?
Hristo Hristov
@Hristo Hristov: My guess is that all cases of such reasoning imply that if such disaster happens either noone survives or noone will take time to find who is at fault and punish that person.
sharptooth