views:

357

answers:

8

Imagine you have a library for working with some sort of XML file or configuration file. The library reads the whole file into memory and provides methods for editing the content. When you are done manipulating the content you can call a write to save the content back to file. The question is how to do this in a safe way.

Overwriting the existing file (starting to write to the original file) is obviously not safe. If the write method fails before it is done you end up with a half written file and you have lost data.

A better option would be to write to a temporary file somewhere, and when the write method has finished, you copy the temporary file to the original file.

Now, if the copy somehow fails, you still have correctly saved data in the temporary file. And if the copy succeeds, you can remove the temporary file.

On POSIX systems I guess you can use the rename system call which is an atomic operation. But how would you do this best on a Windows system? In particular, how do you handle this best using Python?

Also, is there another scheme for safely writing to files?

+2  A: 

In Win API I found quite nice function ReplaceFile that does what name suggests even with optional back-up. There is always way with DeleteFile, MoveFile combo.

In general what you want to do is really good. And I cannot think of any better write scheme.

Michal Sznajder
It would be even better if you illustrated that with the proper Python code that calls the MS library API.
RedGlyph
I didn't realize ReplaceFile existed. Reading the docs, it appears to do a lot more than just rename. It maintains many of the attributes of the replaced file, so it seems designed specifically for this purpose.
Jason R. Coombs
A: 

Perhaps this helps http://stackoverflow.com/questions/489861/locking-a-file-in-python

mtvee
I am not really concerned with multiple processes writing to the same file. But thanks for the link.
Rickard Lindberg
+3  A: 

A simplistic solution. Use tempfile to create a temporary file and if writing succeeds the just rename the file to your original configuration file.

For locking a file, see portalocker.

The MYYN
If tempfile is created in another filesystem than the target one, then the final rename either won't work, or it won't be atomic.
ΤΖΩΤΖΙΟΥ
But unless the rename is atomic, I risk loosing data, right?
Rickard Lindberg
+2  A: 

The standard solution is this.

  1. Write a new file with a similar name. X.ext# for example.

  2. When that file has been closed (and perhaps even read and checksummed), then you two two renames.

    • X.ext (the original) to X.ext~

    • X.ext# (the new one) to X.ext

  3. (Only for the crazy paranoids) call the OS sync function to force dirty buffer writes.

At no time is anything lost or corruptable. The only glitch can happen during the renames. But you haven't lost anything or corrupted anything. The original is recoverable right up until the final rename.

S.Lott
But if you do not rename twice (creating a backup as you did) you risk loosing data if the rename is not atomic, right?
Rickard Lindberg
Rename *is* atomic in many OS's. However, the backup is more important than hand-wringing about the atomicity of the operation. Remember: the odds of a crash (except in Windows) is very small. The odds of a crash in the middle of a rename (which is only a few instructions plus a sync is very, very small.
S.Lott
+3  A: 

If you want to be POSIXly correct and save you have to:

  1. Write to temporary file
  2. Flush and fsync the file (or fdatasync)
  3. Rename over the original file

Note that calling fsync has unpredictable effects on performance -- Linux on ext3 may stall for disk I/O whole numbers of seconds as a result, depending on other outstanding I/O.

Notice that rename is not an atomic operation in POSIX -- at least not in relation to file data as you expect. However, most operating systems and filesystems will work this way. But it seems you missed the very large linux discussion about Ext4 and filesystem guarantees about atomicity. I don't know exactly where to link but here is a start: ext4 and data loss.

Notice however that on many systems, rename will be as safe in practice as you expect. However it is in a way not possible to get both -- performance and reliability across all possible linux confiugrations!

With a write to a temporary file, then a rename of the temporary file, one would expect the operations are dependent and would be executed in order.

The issue however is that most, if not all filesystems separate metadata and data. A rename is only metadata. It may sound horrible to you, but filesystems value metadata over data (take Journaling in HFS+ or Ext3,4 for example)! The reason is that metadata is lighter, and if the metadata is corrupt, the whole filesystem is corrupt -- the filesystem must of course preserve it self, then preserve the user's data, in that order.

Ext4 did break the rename expectation when it first came out, however heuristics were added to resolve it. The issue is not a failed rename, but a successful rename. Ext4 might sucessfully register the rename, but fail to write out the file data if a crash comes shortly thereafter. The result is then a 0-length file and neither orignal nor new data.

So in short, POSIX makes no such guarantee. Read the linked Ext4 article for more information!

kaizer.se
Maybe I misunderstood. But if you do a rename on a POSIX system, aren't you guarantied that the destination is unmodified if the rename fails? "If the rename() function fails for any reason other than [EIO], any file named by new shall be unaffected." I guess it can still leave corrupt data then.
Rickard Lindberg
the issue is a successful rename, and that just a rename does not guarantee atomicity of the whole operation.
kaizer.se
What you mean is that rename() doesn't checkpoint. It most certainly is atomic, see: http://www.opengroup.org/onlinepubs/009695399/functions/rename.html
geocar
+2  A: 

If you see Python's documentation, it clearly mentions that os.rename() is an atomic operation. So in your case, writing data to a temporary file and then renaming it to the original file would be quite safe.

Another way could work like this:

  • let original file be abc.xml
  • create abc.xml.tmp and write new data to it
  • rename abc.xml to abc.xml.bak
  • rename abc.xml.tmp to abc.xml
  • after new abc.xml is properly put in place, remove abc.xml.bak

As you can see that you have the abc.xml.bak with you which you can use to restore if there are any issues related with the tmp file and of copying it back.

Shailesh Kumar
This is similar to S.Lott's answer with the addition to delete the backup file. It seems like it is the best way to do it. Thanks.
Rickard Lindberg
I actually saw this implementation in the way ZODB (Zope Object Database) does the packing of its database file (Data.fs) ie removing of unused space of older transactions from the database file. The code is regular python code, packing is done in a temporary file and then the steps similar to the above are conducted. ZODB has been around for so many years and works well both on Windows and POSIX platforms, so I believe that this approach should work.
Shailesh Kumar
Python cannot enforce the guarantee that rename be atomic. As far as I know, it is just calling the OS's system call.The procedure you give works well, though.
Tommy McGuire
A: 

Per RedGlyph's suggestion, I'm added an implementation of ReplaceFile that uses ctypes to access the Windows APIs. I first added this to jaraco.windows.api.filesystem.

ReplaceFile = windll.kernel32.ReplaceFileW
ReplaceFile.restype = BOOL
ReplaceFile.argtypes = [
 LPWSTR,
 LPWSTR,
 LPWSTR,
 DWORD,
 LPVOID,
 LPVOID,
 ]

REPLACEFILE_WRITE_THROUGH = 0x1
REPLACEFILE_IGNORE_MERGE_ERRORS = 0x2
REPLACEFILE_IGNORE_ACL_ERRORS = 0x4

I then tested the behavior using this script.

from jaraco.windows.api.filesystem import ReplaceFile
import os

open('orig-file', 'w').write('some content')
open('replacing-file', 'w').write('new content')
ReplaceFile('orig-file', 'replacing-file', 'orig-backup', 0, 0, 0)
assert open('orig-file').read() == 'new content'
assert open('orig-backup').read() == 'some content'
assert not os.path.exists('replacing-file')

While this only works in Windows, it appears to have a lot of nice features that other replace routines would lack. See the API docs for details.

Jason R. Coombs
A: 

You could use the fileinput module to handle the backing-up and in-place writing for you:

import fileinput
for line in fileinput.input(filename,inplace=True, backup='.bak'):
    # inplace=True causes the original file to be moved to a backup
    # standard output is redirected to the original file.
    # backup='.bak' specifies the extension for the backup file.

    # manipulate line
    newline=process(line)
    print(newline)

If you need to read in the entire contents before you can write the newline's, then you can do that first, then print entire new contents with

newcontents=process(contents)
for line in fileinput.input(filename,inplace=True, backup='.bak'):
    print(newcontents)
    break

If the script ends abruptly, you will still have the backup.

unutbu