views:

1742

answers:

8

Once again I was in a design review, and encountered the claim that the probability of a particular scenario was "less than the risk of cosmic rays" affecting the program, and it occurred to me that I didn't have the faintest idea what that probability is.

"Since 1/2^128 is 1 out of 340282366920938463463374607431768211456, I think we're justified in taking our chances here, even if these computations are off by a factor of a few billion... We're way more at risk for cosmic rays to screw us up, I believe."

Is this programmer correct? What is the probability of a cosmic ray hitting a computer and affecting the execution of the program?

update: it seems non-error corrected memory is quite likely to be hit, if you have a reasonably large number of servers. how well does error-corrected memory reduce the effect of this hit rate?

note regarding closing: this is a real question, affecting a real software development project. The claim has been made, "probability is less than x." What is x?

+33  A: 
KennyTM
Improved error checking? Back when that study was published, most personal computers had a parity bit on each byte of memory. Now error control circuitry on memory systems is generally found only on server-level machines (as far as I know), and not even on all server machines. However, when there is error circuitry on memory systems today, it's generally ECC instead of just parity.
Michael Burr
More importantly, the chip feature size for CPUs in 1995 was around 0.35 µm or 350nm. It's now 1/10th that size at 35nm.
Joe Koberg
Is it possible that instead of reducing risk, decreased size would increase risk since it would take less energy to change the state of each bit?
Robert
@Robert: why it takes less energy? Anyway, the energy of cosmic ray is so high that this factor isn't important I think.
KennyTM
Reduced size definitely increases risk. Hardened processors for space vehicles use very large feature sizes to avoid cosmic ray effects.
Joe Koberg
Not just cosmic rays, radioactive isotopes in the materials used in the chip are a much bigger problem. Makers go to huge lengths to make sure the silicon, solder, encapsulation etc doesn't contain any alpha or beta emitters.
Martin Beckett
And then there's the fact that the chip sizes actually _grow_, despite the fact that feature sizes shrink. I suppose that with bigger chips and cosmic rays it's the same as with bigger sails and wind?
sbi
It's sad to see more error tolerant processors (such as SPARC, et al.) go by the wayside. They have all kinds of nifty self-correcting mechanisms built in for such things. Oh well, it seems like the x86 architecture is finally noticing this issue and is starting to design for it too.
Brian Knoblauch
Wow! This means that about 1 byte in my PC gets corrupted every two days.
Stefan Monov
+8  A: 

Wikipedia cites a study by IBM in the 90s suggesting that "computers typically experience about one cosmic-ray-induced error per 256 megabytes of RAM per month." Unfortunately the citation was to an article in Scientific American, which didn't give any further references. Personally, I find that number to be very high, but perhaps most memory errors induced by cosmic rays don't cause any actual or noticable problems.

On the other hand, people talking about probabilities when it comes to software scenarios typically have no clue what they are talking about.

JesperE
I guess they should be more clear about "one cosmic-ray-induced error"... If I had to guess, I would say one flipped bit over an array of 256MB ram per month.
wtaniguchi
The probability of a bit being flipped must be multiplied by the probability of the bit having a noticeable affect on the program. I'm guessing the second probability is a lot lower than you think.
Mark Ransom
@Mark: Typical computer programs don't have that kind of fault-tolerance built-in. A single-bit error in the program code will more likely than not crash the program, if the broken code is executed.
Robert Harvey
Yes, but most of the memory contains data, where the flip won't be that visiblp.
zoul
@Robert Harvey, not only will most of the program be data, but much of the actual program will be executed rarely if ever. Think about how tough it is to get 100% code coverage for testing. Also some instruction changes might be very subtle. Combine all those and the probabilities start getting very low.
Mark Ransom
@zoul. lol at 'visiblp', but if e=1100101 and p=1110000 then you're the unfortunate victim of *3* bit flips!
PaulG
@Paul: or *one* finger blip.
Mark
+18  A: 

Apparently, not insignificant. From this New Scientist article, a quote from an Intel patent application:

"Cosmic ray induced computer crashes have occurred and are expected to increase with frequency as devices (for example, transistors) decrease in size in chips. This problem is projected to become a major limiter of computer reliability in the next decade. "

You can read the full patent here.

ire_and_curses
+2  A: 

More often, noise can corrupt data. Checksums are used to combat this on many levels; in a data cable there is typically a parity bit that travels alongside the data. This greatly reduces the probability of corruption. Then on parsing levels, nonsense data is typically ignored, so even if some corruption did get past the parity bit or other checksums, it would in most cases be ignored.

Also, some components are electrically shielded to block out noise (probably not cosmic rays I guess).

But in the end, as the other answerers have said, there is the occasional bit or byte that gets scrambled, and it's left up to chance whether that's a significant byte or not. Best case scenario, a cosmic ray scrambles one of the empty bits and has absolutely no effect, or crashes the computer (this is a good thing, because the computer is kept from doing harm); but worst case, well, I'm sure you can imagine.

Ricket
+6  A: 

Well, cosmic rays apparently caused the electronics in Toyota cars to malfunction, so I would say that the probability is very high :)

Are cosmic rays really causing Toyota woes?

Kevin Crowell
"Federal regulators are studying whether sudden acceleration in Toyotas is linked to cosmic rays." This is why you should never give federal regulators power over your life.
Will
I guess the theory here is that cosmic rays are flipping bits in older brains causing them to malfunction and press the wrong pedal.
Knox
"Apparently"? I'd say that's a wild guess at this point. My own wild guess is that this phenomenon is a result of that old nightmare of embedded systems (actually most complex computer systems) - the race condition.
Michael Burr
@Knox: Get out your old tinfoil hat, it *is* useful!
Roger Pate
@Kevin: Comments are appropriate for jokes, not answers. This does not even attempt to answer the question.
Roger Pate
@Roger Providing possible evidence, no matter how far-fetched it may be, does not help answer the question?
Kevin Crowell
It may not be a joke. I've seen some seriously weird stuff like that happen before. Not as rare as most people think.
Brian Knoblauch
@Roger: There's quite a tradition of humorous answers being well taken and up-voted on SO. (Heck, there's even been a tradition of humorous _questions_. Sadly, this has been stopped by the closing police.)
sbi
@Brian: The OP's now-deleted comment (along the lines of "it is a very relevant joke!") indicates the spirit in which it was intended. @sbi: There's a stronger tradition and convention for jokes in comments, and I find it mildly offensive to post such noise answers on questions seriously asked in good faith. I'll willingly downvote any such "not useful" answers, but this one didn't even try to answer the question.
Roger Pate
+4  A: 

I once programmed devices which were to fly in space, and then you (supposedly, noone ever showed me any paper about it, but it was said to be common knowledge in the business) could expect cosmic rays to induce errors all the time.

erikkallen
Above the atmosphere two things happen: 1) the total flux is higher 2) much more of it comes in the form of heavy, very energetic particles (with enough energy to flip a bit packed into a small space).
dmckee
+5  A: 

If a program is life-critical (it will kill someone if it fails), it needs to be written in such a way that it will either fail-safe, or recover automatically from such a failure. All other programs, YMMV.

Toyotas are a case in point. Say what you will about a throttle cable, but it is not software.

See also http://en.wikipedia.org/wiki/Therac-25

Robert Harvey
Nevermind the software for throttles. The sensors and wiring for the throttles is the weak point. My Mitsubishi throttle position sensor failed into a random number generator... No unintended acceleration, but it sure didn't do anything good for the fuel mixture!
Brian Knoblauch
@Brian: Good software would have figured out that the data points were discontinuous, and concluded that the data was bad.
Robert Harvey
@Robert ...and then what... Good data is required. Knowing it's bad doesn't help any. Not something you can magically work around.
Brian Knoblauch
@Brian: Well, for one thing, you can take corrective action based on the knowledge that your data is bad.
Robert Harvey
+4  A: 

Memory errors are real, and ECC memory does help. Correctly implemented ECC memory will correct single bit errors and detect double bit errors (halting the system if such an error is detected.) You can see this from how regularly people complain about what seems to be a software problem that is resolved by running Memtest and discovering bad memory. Of course a transient failure cause by a cosmic ray is different to a consistently failing piece of memory, but it is relevant to the broader question of how much you should trust your memory to operate correctly.

An analysis based on a 20MB resident size might be appropriate for trivial applications, but large systems routinely have multiple servers with large main memories.

Interesting link: http://cr.yp.to/hardware/ecc.html

The Corsair link in the page unfortunately seems to be dead.

janm