views:

556

answers:

7

Clarified Question:

When the OS sends the command to write a sector to disk is it atomic? i.e. Write of new data succeeds fully or old data is left intact should the power fail immediately following the write command. I don't care about what happens in multiple sector writes - torn pages are acceptable.

Old Question:

Say you have old data X on disk, you write new data Y over it, and a tree falls on the power line during that write. With no fancy UPS or battery backed disk controller, you can end up with a torn page, where the data on disk is part X and part Y. Can you ever end up with a situation where the data on disk is part X, part Y, and part garbage?

I've been trying to understand the design of ACID systems like databases, and to my naive thinking, it seems firebird, which does not use a write-ahead log, is relying that a given write will not destroy old data (X) - only fail to fully write new data (Y). That means that if part of X is being overwritten, only the part of X that is being overwritten can be changed, not the part of X we intend to keep.

To clarify, this means if you have a page sized buffer, say 4096 bytes, filled with half Y, half X that we want to keep - and we tell the OS to write that buffer over X, there is no situation short of serious disk failure where the half X that we want to keep is corrupted during the write.

+1  A: 

I suspect this assumption is wrong.

Modern HDDs encode the data in sectors - and additionally protect it with ECC. Therefore you can end-up with garbaging all the sector content - it will just not make sense with the encoding used.

As for increasingly poplular SSDs, the situation is even more gruesome - the block is cleared prior to being overwritten, so, depending on the firmware being used and the amount of free space, entirely unrelated sectors can be damaged.

By the way, an OS crash will not lead to data being damaged within single sector.

EFraim
I suspect that the OP is referring more to databases (and their data integrity mechanisms) than the actual disk itself. Databases contain mechanisms such as transactions, serialization and journaling to prevent what you are describing from damaging the data.
Robert Harvey
Good point. Also, I think the partition table will remove a "pointer" to data X before it tries to write data Y. I am not sure, but just throwing that out there.
Jay
@Jay: What you are referring is called "soft updates" and is actually increadibly difficult to get right. (and not many file systems do it; IMFO the FreeBSD one does) In fact even journaled FS like ext3 can get garbled data into a file in case of a crash.
EFraim
@Robert: IMHO the OP is interested just HOW the journal ensures data integrity.
EFraim
I think the respresentation of the data on disk is irrelevant. What matters is the operating system's data integrity guarantees for the system calls you are using. This varies from operating system to operating system, and even between multiple file systems on the same operating system, or even depending on the configuration of a particular file system (e.g. the ext3 data={data,ordered,writeback} option).
daf
@EFraim, could you elaborate on your last statement please? "By the way, an OS crash will not lead to data being damaged within single sector." And yes I am more interested with how software achieves the D in ACID.
Eloff
@Eloff: the fact is that once the OS had given the "write sector" command to the disk, even if it were to crash the next milisecond, the disk controller will put it to the platter nonetheless.
EFraim
@EFraim: that answers my question then. Sector writes are atomic, they either succeed entirely, or don't happen at all. Which makes torn pages the worst you can reasonably expect in an OS crash.
Eloff
@EFraim: So you're saying sector writes are atomic even in the face of an OS crash (which as I understand it includes power failure and system component failures.) If that is indeed correct, please make that clear in your answer and I will accept it as the answer to my question. Your answer confused me because it seems to say both that you can get garbaged sectors, and that you cannot get damage within a single sector (assuming perfectly working hard disk.)
Eloff
+1  A: 

The answer to your first question depends on the hardware involved. At least with some older hardware, the answer was yes -- a power failure could result it garbage being written to the disk. Most current disks, however, have a bit of a "UPS" built into the disk itself -- a capacitor that's large enough to power the disk long enough to write the data in the on-disk cache out to the disk platter. They also have circuitry to detect whether the power supply is still good, so when the power gets flaky, they write the data in the cache to the platter, and ignore garbage they might receive.

As far as a "torn page" goes, a typical disk only accepts commands to write an entire sector at a time, so what you'll get will normally be an integral number of sectors written correctly, and others remaining unchanged. If, however, you're using a logical page size that's larger than a single sector, you can certainly end up with a page that's partially written.

That, however, mostly applies to a direct connection to a normal moving-platter type hard drive. With almost anything else, the rules can and often will be different. Just for an obvious example, if you're writing over the network, you're mostly at the mercy of the network protocol in use. If you transmit data over TCP, data that doesn't match up with the CRC will be rejected, but the same data transmitted over UDP, with the same corruption, might be accepted.

Jerry Coffin
@Jerry: IMHO the question is concerned with the case the disk got the command to write a single sector but has no sufficient power to complete. I am pretty sure not all modern disks can always finish writing a sector.
EFraim
@EFraim: that was the case I had in mind, if the modern disk cannot finish writing the current sector, it must leave it as a mixture of OLD and NEW data only, if any garbage data makes it into that sector, it would need to be restored from a duplicate copy somewhere else.
Eloff
You can get battery (or capacitor) backed disks or raid controllers that will write out the cache in the event of system failure - which normally should mean that fsync only has to wait for data to hit the write cache (very fast.) Running on hardware like that, torn pages are still possible, but a sector should behave atomically, either written or not. I had in mind cheaper disks than that - but not so cheap that they lie to the OS about fsync, as you cannot safely run an ACID db on that hardware.
Eloff
+8  A: 

I think torn pages are not the problem. As far as I know, all drives have enough power stored to finish writing the current sector when the power fails.

The problem is that everybody lies.

At least when it comes to the database knowing when a transaction has been committed to disk, everybody lies. The database issues an fsync, and the operating system only returns when all outstanding writes have been committed to disk, right? Maybe not. It's common, especially with RAID cards and/or SATA drives, for your program to be told everything has committed (that is, fsync returns) and yet there is data not yet on the drive.

You can try using Brad's diskchecker to find out if the platform you are going to use for your database can survive pulling the plug without losing data. The bottom line: If diskchecker fails, the platform is not safe for running a database. Databases with ACID rely upon knowing when a transaction has been committed to backing store and when it has not. This is true whether or not the databases uses write-ahead loggin (and if the database returns to the user without having done an fsync, then transactions can be lost in the event of a failure, so it should not claim that it provides ACID semantics).

There's a long thread on the Postgresql mailing list discussing durability. It starts out talking about SSDs, but then it gets into SATA drives, SCSI drives, and file systems. You may be surprised to learn how exposed your data can be to loss. It's a good thread for anyone with a database that needs durability, not just those running Postgresql.

Wayne Conrad
You are correct, you have to deploy your database using storage devices that correctly report back to the OS when data is fsynced, otherwise the D in ACID is not possible. There are torn pages to deal with when page size (write size) is a multiple of the sector size, but as long as drives finish writing the current sector, and report fsync correctly to the OS, torn pages is probably the worst situation you can commonly encounter.
Eloff
Sudhanshu
+2  A: 

People don't seem to agree on what happens during a sector write if the power fails. Maybe because it depends on the hardware being used, and even the filesystem.

From wikipedia (http://en.wikipedia.org/wiki/Journaling_file_system):

Some disk drives guarantee write atomicity during a power failure. Others, however, may stop writing midway through a sector after power is lost, leaving it mismatched against its error-correcting code. The sector is thus corrupt and its contents lost. A physical journal guards against such corruption because it holds a complete copy of the sector, which it can replay over the corruption upon next mount.

Seems to suggest that some hard drives will not finish writing the sector, but that a journaling filesystem can protect you from data loss the same way the xlog protects a database.

From the linux kernel mailing list in a discussion on ext3 journaling filesystem:

In any case bad sector checksum is hardware bug. Sector write is supposed to be atomic, it either happens or not.

I'd tend to believe that over the wiki comment. Actually, the very existence of a database (firebird) with no xlog implies that sector write is atomic, that it cannot clobber data you did not mean to change.

There's quite a bit of discussion Here about atomicity of sector writes, and again no agreement. But the people who are disagreeing seem to be talking about multiple-sector writes (which are not atomic on many modern hard-drives.) Those who are saying sector writes are atomic do seem to know more about what they're talking about.

Eloff
A: 

I would expect one torn page to consist of part X, part Y, and part unreadable sector. If a head is in the middle of writing a sector when the power fails, the drive should park the heads immediately, so that the rest of the drive (aside from that one sector) will remain undamaged.

In some cases I would expect several torn pages consisting of part X and part Y, but only one torn page would include an unreadable sector. The reason for several torn pages is that the drive can buffer lots of writes internally, and the order of writing might interleave various sectors from various pages.

I've read conflicting stories about whether a new write to the unreadable sector will make it readable again. Even if the answer is yes, that will be new data Z, neither X nor Y.

Windows programmer
+4  A: 

No, they are not. Worse yet, disks may lie and say the data is written when it is in fact in the disk cache, under default settings. For performance reasons, this may be desirable (actual durability is up to an order of magnitude slower) but it means if you lose power and the disk cache is not physically written, your data is gone.

Real durability is both hard and slow unfortunately, since you need to make at least one full rotation per write, or 2+ with journalling/undo. This limits you to a couple hundred DB transactions per second, and requires disabling write caching at a fairly low level.

For practical purposes though, the difference is not that big of a deal in most cases.

See:

BobMcGee
A: 

Nobody seems to agree on this question. So I spent a lot of time trying different Google queries until I finally found an answer.

from Dr. Stephen Tweedie, RedHat employee and linux kernel filesystem and virtual memory developer in a talk on ext3 (which he developed) transcript here. If anyone knows, it'd be him.

"It's not sufficient just to write the thing to the journal, because there's got to be some mark in the journal which says: well, (has this journal record actually) does this journal record actually represent a complete consistency to the disk? And the way you do that is by having some atomic operation which marks that transaction as being complete on disk" [23m, 14s]

"Now, disks these days actually make these guarantees. If you start a write operation to a disk, then even if the power fails in the middle of that sector write, the disk has enough power available, and it can actually steal power from the rotational energy of the spindle; it has enough power to complete the write of the sector that's being written right now. In all cases, the disks make that guarantee." [23m, 41s]

Eloff