views:

257

answers:

4

I have been following the discussion on the "bug" on EXT4 that causes files to be zeroed in crash if one uses the "create temp file, write temp file, rename temp to target file" process. POSIX says that unless fsync() is called, you cannot be sure the data has been flushed to harddisk.

Obviously doing:

0) get the file contents (read it or make it somehow)
1) open original file and truncate it
2) write new contents
3) close file

is not good even with fsync() as the computer can crash during 2) or fsync() and you end up with partially written file.

Usually it has been thought that this is pretty safe:

0) get the file contents (read it or make it somehow)
1) open temp file
2) write contents to temp file
3) close temp file
4) rename temp file to original file

Unfortunately it isn't. To make it safe on EXT4 you would need to do:

0) get the file contents (read it or make it somehow)
1) open temp file
2) write contents to temp file
3) fsync()
4) close temp file
5) rename temp file to original file

This would be safe and on crash you should either have the new file contents or old, never zeroed contents or partial contents. But if the application uses lots of files, fsync() after every write would be slow.

So my question is, how to modify multiple files efficiently on a system where fsync() is required to be sure that changes have been saved to disk? And I really mean modifying many files, as in thousands of files. Modifying two files and doing fsync() after each wouldn't be too bad, but fsync() does slow things down when modifying multiple files.

EDIT: changed the fsync() close temp file to corrent order, added emphasis on writing many many many files.

A: 

You need to swap 3 & 4 in your last listing - fsync(fd) uses the file descriptor. and I don't see why that would be particularly costly - you want the data written to disk by the close() anyway. So the cost will be the same between what you want to happen and what will happen with fsync().

If the cost is too much, (and you have it) fdatasync(2) avoid syncing the meta-data, so should be lighter cost.

EDIT: So I wrote some extremely hacky test code:

#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/time.h>
#include <time.h>
#include <stdio.h>
#include <string.h>

static void testBasic()
{
    int fd;
    const char* text = "This is some text";

    fd = open("temp.tmp", O_WRONLY | O_CREAT);
    write(fd,text,strlen(text));
    close(fd);
    rename("temp.tmp","temp");
}

static void testFsync()
{
    int fd;
    const char* text = "This is some text";

    fd = open("temp1", O_WRONLY | O_CREAT);
    write(fd,text,strlen(text));
    fsync(fd);
    close(fd);
    rename("temp.tmp","temp");
}

static void testFdatasync()
{
    int fd;
    const char* text = "This is some text";

    fd = open("temp1", O_WRONLY | O_CREAT);
    write(fd,text,strlen(text));
    fdatasync(fd);
    close(fd);
    rename("temp.tmp","temp");
}

#define ITERATIONS 10000

static void testLoop(int type)
{
    struct timeval before;
    struct timeval after;
    long seconds;
    long usec;
    int i;

    gettimeofday(&before,NULL);
    if (type == 1)
    {
        for (i = 0; i < ITERATIONS; i++)
        {
            testBasic();
        }
    }
    if (type == 2)
    {
        for (i = 0; i < ITERATIONS; i++)
        {
            testFsync();
        }
    }
    if (type == 3)
    {
        for (i = 0; i < ITERATIONS; i++)
        {
            testFdatasync();
        }
    }
    gettimeofday(&after,NULL);

    seconds = (long)(after.tv_sec - before.tv_sec);
    usec = (long)(after.tv_usec - before.tv_usec);
    if (usec < 0)
    {
        seconds--;
        usec += 1000000;
    }

    printf("%ld.%06ld\n",seconds,usec);
}

int main()
{
    testLoop(1);
    testLoop(2);
    testLoop(3);
    return 0;
}

On my laptop that produces:

0.595782
6.338329
6.116894

Which suggests doing the fsync() is ~10 times more expensive. and fdatasync() is slightly cheaper.

I guess the problem I see is that every application is going to think it's data is important enough to fsync(), so the performance advantages of merging writes over a minute will be eliminated.

Douglas Leeder
With the rename-method one could write 100.000 config files without a fsync(), and doing 100.000 fsync() would be slow.
Raynet
A: 

How frequently do crashes that cause data loss occur? If more than once a decade, I suggest you have some deep-seated infrastructure problems.

anon
It seems people do have crashes, which is why the whole EXT4 problem surfaced. Not everyone has UPS or has their laptop setup correct to sleep when it runs out of battery.
Raynet
A: 

My own answer would be to keep to the modifications on temp files, and after finishing writing them all, do one fsync() and then do rename on them all.

Raynet
fsync() is per-fd - perhaps you are thinking of sync()?
Douglas Leeder
I think it's time for a benchmark - why don't you write one, and we'll see what the impact is?
Douglas Leeder
I did a quick benchmark, the rename scenario is 10-20% slower with fsync() for each file after write. I would assume fsync() is the right command as it flushes the file I just written, I don't want to flush anything else.
Raynet
With larger files I can get even higher slowdowns
Raynet
This is only feasible when one process writes the many small files. That is not true in the general case.
Aaron Digulla
I think you could allow multiple processes to write files as long as you do the fsync() -> rename in single process.
Raynet
+2  A: 

The short answer is: Solving this in the app layer is the wrong place. EXT4 must make sure that after I close the file, the data is written in a timely manner. As it is now, EXT4 "optimizes" this writing to be able to collect more write requests and burst them out in one go.

The problem is obvious: No matter what you do, you can't be sure that your data ends on the disk. Calling fdisk() manually only makes things worse: You basically get in the way of EXT4's optimization, slowing the whole system down.

OTOH, EXT4 has all the information necessary to make an educated guess when it is necessary to write data out to the disk. In this case, I rename the temp file to the name of an existing file. For EXT4, this means that it must either postpone the rename (so the data of the original file stays intact after a crash) or it must flush at once. Since it can't postpone the rename (the next process might want to see the new data), renaming implicitly means to flush and that flush must happen on the FS layer, not the app layer.

EXT4 might create a virtual copy of the filesystem which contains the changes while the disk is not modified (yet). But this doesn't affect the ultimate goal: An app can't know what optimizations the FS if going to make and therefore, the FS must make sure that it does its job.

This is a case where ruthless optimizations have gone too far and ruined the results. Golden rule: Optimization must never change the end result. If you can't maintain this, you must not optimize.

As long as Tso believes that it is more important to have a fast FS rather than one which behaves correctly, I suggest not to upgrade to EXT4 and close all bug reports about this is "works as designed by Tso".

[EDIT] Some more thoughts on this. You could use a database instead of the file. Let's ignore the resource waste for a moment. Can anyone guarantee that the files, which the database uses, won't become corrupted by a crash? Probably. The database can write the data and call fsync() every minute or so. But then, you could do the same:

while True; do sync ; sleep 60 ; done

Again, the bug in the FS prevents this from working in every case. Otherwise, people wouldn't be so bothered by this bug.

You could use a background config daemon like the Windows registry. The daemon would write all configs in one big file. It could call fsync() after writing everything out. Problem solved ... for your configs. Now you need to do the same for everything else your apps write: Text documents, images, whatever. I mean almost any Unix process creates a file. This is the freaking basis of the whole Unix idea!

Clearly, this is not a viable path. So the answer remains: There is no solution on your side. Keep bothering Tso and the other FS developers until they fix their bugs.

Aaron Digulla
Well, I am still looking a solution for doing this, I don't want to depend on behaviour that has not been defined on the spec.
Raynet
Raynet, Tso has written something which doesn't work. There is nothing you can do until Tso fixes this issue.
Aaron Digulla
Perhaps, but I still would prefer to have code that works and doesn't depend on how Tso or anyone else has read the POSIX spec.
Raynet
Preferred option would probably be an addition to POSIX allowing developers somehow mark that this operation is atomic.
Raynet
Both rename and mkdir are atomic in POSIX. But the FS guys think that doesn't include commit. I remember that the reiserfs guys had similar problems until they started to flush every write to the disk almost immediately to the log. Later, a cleanup thread would spread the log to the disk atomically.
Aaron Digulla
Aaron Digulla
Isn't this also a problem if the user has mounted an EXT3 filesystem with otehr options than 'data=ordered'. I really would like to write code that didn't depend on the user happening to use correct mount options.
Raynet
Aaron Digulla
Yes yes, but they don't work like that in real world, EXT3 is just as bad without the 'data=ordered' mount option (albeit it is default).
Raynet
Your "while True; do sync ; sleep 60 ; done" is what the EXT4 is doing anyway - it writing out the data once a minute, by default.
Douglas Leeder
I don't know where your definition of 'behaves correctly'. POSIX seems to be the only reasonable choice, and (AFAIK) EXT4 is POSIX compliant?
Douglas Leeder
@Dougly: Re "while..." In Laptop mode, the default is much bigger than a minute. And when you're a kernel developer, a minute can be too long, too.
Aaron Digulla
Doug: EXT4 is POSIX compliant but only to the letter, not to the intention. POSIX doesn't say how a FS has to behave when a crash happens at a random time.
Aaron Digulla