views:

920

answers:

7

When a file is closed using close() or fclose() (for example), does Linux guarantee that the file is written back to (persistent) disc?

What I mean is, if close() returns 0 and then immediately afterwards the power fails, are previously written data guaranteed to persist, i.e. be durable?

The fsync() system call does provide this guarantee. Is closing a file also sufficient?

I can't find anything which makes any claim one way or another at the moment.


Question 2:

If close() does implicitly do an fsync(), is there a way of telling it not to?

+2  A: 

No, it's not guaranteed. The OS has its own caching. All close really guarantees is that the programs buffers are flushed to the OS, but the OS may still be holding onto it unwritten. I believe there is some controversy in the Linux kernel world because even fsync doesn't guarantee that it's flushed to disk, at least in ext3.

Paul Tomblin
Please provide a citation, it would be a hideous bug if fsync() did not do what it's supposed to do.
MarkR
i.e. it would wholly destroy databases' durability.
MarkR
The controversy I've seen is something else, which won't fit in a comment. I'm writing an answer for it.
David Thornley
+7  A: 

No, close does not perform an fsync(2) and would batter many machines to death if it did so. Many intermediate files are opened and closed by their creator, then opened and closed by their consumer, then deleted, and this very common sequence would require touching the disk if close(2) performed an automatic fsync(2). Instead, the disk is usually not touched and the disk never knows the file was there.

Brandon Craig Rhodes
Are you sure? Do you have a reference for this?
MarkR
Yes; the man page to close(2), among other things, mentions that no sync is performed. For the larger issue of filesystem caching, see any good text that covers modern disk caching in an OS, such as Hennessy and Patterson's "Quantitative" operating systems textbook.
Brandon Craig Rhodes
A: 

I don't think Linux can guarantee this since the drive itself can cache data too.

Adrian Grigore
But if the drive has either been told not to, or (more commonly) is a battery-backed RAID controller, this guarantee should be able to be provided in principle.
MarkR
In principle that might be true, but in reality it is certainly not the norm.
Adrian Grigore
That's a problem with the drive, then, if it can report back that it successfully wrote something and lied. Once the device driver is convinced that it's written to disk, Linux has done all it can do.
David Thornley
The whole point of having a hardware write cache is not blocking the OS until the write operation has completed. The drive has no choice but to "lie". Please don't get me wrong, I am not putting the blame on Linux, I was just stating that Linux can't guarantee this, even if it makes a good effort.
Adrian Grigore
Does the driver actually lie to the OS? I have no knowledge of this area, but I always assumed that it replied "I got it" and then later says "I wrote it" when it actually does... I also figured the OS has a way to demand that it write immediately. Are these assumptions incorrect?
rmeador
The way I understand it's not the driver that lies to the OS, it's the harddrive itself. Have a look at http://www.jasonbrome.com/blog/archives/2004/04/03/writecache_enabled.html
Adrian Grigore
The only case where a disk should lie saying it wrote data when it didn't is if it has battery-backed caches, so that there can't be data loss anyway.Otherwise, you really should get a better disk.
Nicolás
+7  A: 

From "man 2 close":

A successful close does not guarantee that the data has been successfully saved to disk, as the kernel defers writes.

The man page says that if you want to be sure that your data are on disk, you have to use fsync() yourself.

unbeknown
+2  A: 

No. fclose() doesn't imply fsync(). A lot of Linux file systems delay writes and batch them up, which improves overall performance, presumably reduces wear on the disk drive, and improves battery life for laptops. If the OS had to write to disk whenever a file closed, many of these benefits would be lost.

Paul Tomblin mentioned a controversy in his answer, and explaining the one I've seen won't fit into a comment. Here's what I've heard:

The recent controversy is over the ext4 ordering (ext4 is the proposed successor to the popular ext3 Linux file system). It is customary, in Linux and Unix systems, to change important files by reading the old one, writing out the new one with a different name, and renaming the new one to the old one. The idea is to ensure that either the new one or the old one will be there, even if the system fails at some point. Unfortunately, ext4 appears to be happy to read the old one, rename the new one to the old one, and write the new one, which can be a real problem if the system goes down between steps 2 and 3.

The standard way to deal with this is of course fsync(), but that trashes performance. The real solution is to modify ext4 to keep the ext3 ordering, where it wouldn't rename a file until it had finished writing it out. Apparently this isn't covered by the standard, so it's a quality of implementation issue, and ext4's QoI is really lousy here, there being no way to reliably write a new version of configuration files without constantly calling fsync(), with all the problems that causes, or risking losing both versions.

David Thornley
Maybe that's what I'm thinking of. I should ask the people I heard ranting about it exactly what they were ranting about.
Paul Tomblin
That's the big file system rant on Slashdot recently, anyway. I suggested there that it would be equally standards-conforming for a C++ system to email your boss a nasty resignation letter, and your porn collection (/. meme) to your mom on a null pointer dereference.
David Thornley
There is an option for newer versions of Ext4 that turns on something like Ext3's data=ordered option: http://thread.gmane.org/gmane.comp.file-systems.ext4/12179
Chas. Owens
A: 

It is also important to note that fsync does not guarantee a file is on disk; it just guarantees that the OS has asked the filesystem to flush changes to the disk. The filesystem does not have to write anything to disk

from man 3 fsync

If _POSIX_SYNCHRONIZED_IO is not defined, the wording relies heavily on the conformance document to tell the user what can be expected from the system. It is explicitly intended that a null implementation is permitted.

Luckily, all of the common filesystems for Linux do in fact write the changes to disk; unluckily that still doesn't guarantee the file is on the disk. Many hard drives come with write buffering turned on (and therefore have their own buffers that fsync does not flush). And some drives/raid controllers even lie to you about having flushed their buffers.

Chas. Owens
A: 

You may be also interested in this bug report from the firebird sql database regarding fcntl( O_SYNC ) not working on linux.

In addition, the question you ask implies a potential problem. What do you mean by writing to the disk? Why does it matter? Are you concerned that the power goes out and the file is missing from the drive? Why not use a UPS on the system or the SAN?

In that case you need a journaling file system - and not just a meta-data journaling file system but a full journal even for all the data.

Even in that case you must understand that besides the O/S's involvment, most hard disks lie to you about doing an fsync. - fsync just sends the data to the drive, and it is up to the individual operating system to know how to wait for the drive to flush its own caches.

--jeffk++

jdkoftinoff