views:

406

answers:

4

Recently, I read an article entitled "SATA vs. SCSI reliability". It mostly discusses the very high rate bit flipping in consumer SATA drives and concludes "A 56% chance that you can't read all the data from a particular disk now". Even Raid-5 can't save us as it must be constantly scanned for problems and if a disk does die you are pretty much guaranteed to have some flipped bits on your rebuilt file system.

Considerations:

I've heard great things about Sun's ZFS with Raid-Z but the Linux and BSD implementations are still experimental. I'm not sure it's ready for prime time yet.

I've also read quite a bit about the Par2 file format. It seems like storing some extra % parity along with each file would allow you to recover from most problems. However, I am not aware of a file system that does this internally and it seems like it could be hard to manage the separate files.

Backups (Edit):

I understand that backups are paramount. However, without some kind of check in place you could easily be sending bad data to people without even knowing it. Also figuring out which backup has a good copy of that data could be difficult.

For instance, you have a Raid-5 array running for a year and you find a corrupted file. Now you have to go back checking your backups until you find a good copy. Ideally you would go to the first backup that included the file but that may be difficult to figure out, especially if the file has been edited many times. Even worse, consider if that file was appended to or edited after the corruption occurred. That alone is reason enough for block-level parity such as Par2.

A: 

A good backup strategy

David Thibault
+1  A: 

56% chance I can't read something, I doubt it. I run a mix of RAID 5 and other goodies and just good backup practices but with Raid 5 and a hot spare I haven't ever had data loss so I'm not sure what all the fuss is about. If you're storing parity information ... well you're creating a RAID system using software, a disk failure in R5 results in a parity like check to get back the lost disk data so ... it is already there.

Run Raid, backup your data, you be fine :)

typemismatch
I'm not so sure. If any of the "added up" bits are flipped and you rebuild you end up with the wrong value.
Rick Minerich
The bits aren't flipped by the time they arrive in userspace. The disk controller notices a failed checksum and returns "read failed" to the RAID controller or the OS in the case of software RAID. Thus the bits from the dodgy sector aren't included in the RAID 5 calculation.
tialaramex
You misunderstand. I'm talking about disk reconstruction.
Rick Minerich
What is there to misunderstand? Flipped bits cause a checksum failure in the disk controller, and it reports "read failed". No corruption.
tialaramex
When you pull out a disk and slap a new one in you are missing the data which was on that disk. How would it be able to do a checksum without all of the data?
Rick Minerich
Although it appears to you that a disk "sector" is just 512 bytes of data, the disk actually stores checksums and other integrity data which allow it to verify whether the data was retrieved correctly. Bit errors cause the checksum to fail, and you get "read failed" not corrupt data.
tialaramex
Ahh, I guess I misunderstood. Still, you would be unable to rebuild the data and would have to go to backup. Ideally the data would be able to be reconstructed on the fly when something like this happened.
Rick Minerich
The "read failed" argument is missing the point. As the article points out, modern disks are so large, and have such a high physical error rate that the probability of a coherent error (an error which does not violate the checksum) becomes non-trivial.
Kennet Belenky
+1  A: 

That article significantly exaggerates the problem by misunderstanding the source. It assumes that data loss events are independent, ie that if I take a thousand disks, and get five hundred errors, that's likely to be one each on five hundred of the disks. But actually, as anyone who has had disk trouble knows, it's probably five hundred errors on one disk (still a tiny fraction of the disk's total capacity), and the other nine hundred and ninety-nine were fine. Thus, in practice it's not that there's a 56% chance that you can't read all of your disk, rather, it's probably more like 1% or less, but most of the people in that 1% will find they've lost dozens or hundreds of sectors even though the disk as a whole hasn't failed.

Sure enough, practical experiments reflect this understanding, not the one offered in the article.

Basically this is an example of "Chinese whispers". The article linked here refers to another article, which in turn refers indirectly to a published paper. The paper says that of course these events are not independent but that vital fact disappears on the transition to easily digested blog format.

tialaramex
I personally have had problems with untouched files becoming corrupted on my Desktop's 500GB hd. These are usually images of which I have several hundred thousand for testing and it sometimes causes my tests to fail. Do you have any examples of practical experiments?
Rick Minerich
Sure, if you read that article you were excited about, it links another article, read that one, it doesn't offer any proper references, but eventually after badgering the author links a paper in the comments. That paper utterly demolishes this "56%" thing and gives ~1% as a true figure.
tialaramex
As to your files, check the SMART diagnostics for the drive, and check your RAM, files have to be in RAM before the CPU can do anything with them, a surprising number of people don't consider dodgy RAM as a source of data corruption despite hard disks having ECC while most DIMMS don't.
tialaramex
+1  A: 

ZFS is a start. Many storage vendors provide 520B drives with extra data protection available as well. However, this only protects your data as soon as it enters the storage fabric. If it was corrupted at the host level, then you are hosed anyway.

On the horizon are some promising standards-based solutions to this very problem. End-to-end data protection.

Consider T10 DIF (Data Integrity Field). This is an emerging standard (it was drafted 5 years ago) and a new technology, but it has the lofty goal of solving the problem of data corruption.

unwieldy