views:

217

answers:

4

I've got a few hundred computers running an app. On one computer, I've seen two instances of a single bit being incorrectly set on some strings that I pull out of SQLite. If this was my dev computer I would assume I have a bug somewhere, but there is certainly some number of installations at which point I'll start seeing rare hardware based errors.

This is certainly dependent on how much IO I do, but are there any rules of thumbs for when there is a decent chance of seeing this kind of thing? For example, for TCP packets, this paper determined that silent, undetected corruption will occur in "roughly 1 in 16 million to 10 billion packets".

Unfortunately, running a mem/disk checker on the machine in question is not likely to happen.

+3  A: 

When I notice strange things happening, my strategy is:

  1. check if there is a bug in the code
  2. check if there is a bug in the used library/tool (SQLite, here)
  3. check if there is a bug in the compiler
  4. then, and only then, check for hardware faults

In my 10 years-long career, 99,99% of bugs were software related.

Hope this helps.

friol
You should also add checking the driver or IO controller if working on a customised OS.
Quibblesome
A: 

with subtle errors, it can happen anytime, and from several source, even the most unlikely.

As you can see errors occurring on a single machine, your best option is to handle the damage instead of relying on statistics to tell you when something might go wrong. Whilst the errors might be due to external factors, if you've seen more than one it would be prudent to get that memchecker running on the machine to check that its not faulty hardware. The alternative is trusting to luck that you won't see a total failure.

gbjbaanb
+1  A: 

Bit errors will happen. Consider protecting your data with CRC's or some other kind of error detection/correction mechanism. The odds of it happening are dependant on what kind of hardware you have. If you have memory with ECC, then it's going to be less likely than if you don't for instance, but even ECC memory goes bad and may fail to correct errors. With several hundred computers I would say the odd hardware error is going to be very likely, probably certain, to happen daily.

Steve Baker
A: 

"Wikipedia: ECC memory" says "Recent DRAM tests give widely varying error rates with over 7 orders of magnitude difference, ranging from 10^−10 to 10^−17 error/bit·h, roughly one bit error, per hour, per gigabyte of memory to one bit error, per century, per gigabyte of memory.[7][11][12]"

Even if we use the most optimistic estimate of one bit error per century per gigabyte, if you have a cluster of 100 computers with 2 GB of RAM each, that implies that you'll see a bit error twice a year. (This only includes RAM bit error. You mentioned TCP packet undetected corruption, and you might also consider disk drive failures, accidental power cord unplugging, cooling fan failures, etc). The more pessimistic estimates imply you'll see bit errors far more often -- as Steve Baker said, bit errors are inevitable.

David Cary