views:

43

answers:

2

Let's say you're building a journaling/write-ahead-logging storage system. Can you simply implement this by (for each transaction) appending the data (with write(2)), appending a commit marker, and then fsync-ing?

The scenario to consider is if you do a large set of writes to this log then fsync it, and there's a failure during the fsync. Are the inode direct/indirect block pointers flushed only after all data blocks are flushed, or are there no guarantees that blocks are being flushed in order? If the latter, then during recovery, if you see a commit marker at the end of the file, you can't trust that the data between it and the previous commit marker is meaningful. Thus you have to rely on another mechanism (involving at least another fsync) to determine what extent of the log file is consistent (e.g., writing/fsyncing the data, then writing/fsyncing the commit marker).

If it makes a difference, mainly wondering about ext3/ext4 as the context.

+1  A: 

There's no guarantee on the order in which blocks are flushed to disk. These days even the drive itself can re-order blocks on their way to the platters.

If you want to enforce ordering, you need to at least fdatasync() between the writes that you want ordered. All a sync promises is that when it returns, everything written before the sync has hit storage.

caf
+2  A: 

Note that linux's and mac os's fsync and fdatasync are incorrect by default. Windows is correct by default, but can emulate linux for benchmarking purposes.

Also, fdatasync issues multiple disk writes if you append to the end of a file, since it needs to update the file inode with the new length. If you want to have one write per commit, your best bet is to pre-allocate log space, store a CRC of the log entries in the commit marker, and issue a single fdatasync() at commit. That way, no matter how much the OS / hardware reorder behind your back, you can find a prefix of the log that actually hit disk.

If you want to use the log for durable commits or write ahead, things get harder, since you need to make sure that fsync actually works. Under Linux, you'll want to disable the disk write cache with hdparm, or mount the partition with barrier set to true. [Edit: I stand corrected, barrier doesn't seem to give the correct semantics. SATA and SCSI introduce a number of primitives, such as write barriers and native command queuing, that make it possible for operating systems to export primitives that enable write-ahead logging. From what I can tell from manpages and online, Linux only exposes these to filesystem developers, not to userspace.]

Paradoxically, disabling the disk write cache sometimes leads to better performance, since you get more control over write scheduling in user space; if the disk queues up a bunch of synchronous write requests, you end up exposing strange latency spikes to the application. Disabling write cache prevents this from happening.

Finally, real systems use group commit, and do < 1 sync write per commit with concurrent workloads.

Russell Sears
Thanks for the response Russell - would you mind clarifying what you mean by fsync and fdatasync being incorrect? And in the pre-allocation technique, how do you accomplish the pre-allocation?
Yang
Final question on the relationship between `hdparm -W` and `barrier=1` : from reading the docs, my understanding of `hdparm -W` is that it toggles the device's internal cache, whereas `barrier=1` controls whether we flush blocks from the block layer to the device. Does `barrier=1` also somehow guarantee that the flushed blocks also make it past the device's internal cache?
Yang
And it seems that `barrier=1` only affects journal blocks - wouldn't you need to disable write caching anyway for durable fsyncs?
Yang
Answering own follow-ups: 'correctness' refers to whether write caching/barriers are enabled. Allocate space with `posix_fallocate`. Barriers flush blocks to disk, but past the write cache as well. Barriers do only affect journals, so enabling them isn't necessarily sufficient (unless you do a FS operation like a mv); disabling the write cache is still important.
Yang
As I understand it, data=ordered + barriers=1 is only really useful for "safe writes", in which applications write a new copy of the file, fsync it to disk, then rename it over the old one. This ensures that you either get the old version or the new version of the file at reboot; the file doesn't necessarily reach disk before the mv returns. (The ext4 data loss problems people have been reporting are due to broken applications that skip the fsync call.)
Russell Sears