tags:

views:

422

answers:

9

I am trying to delete 10000+ files at once, atomically e.g. either all need to be deleted at once, or all need to stay in place.

Of course, the obvious answer is to move all the files into a temporary directory, and delete it recursively on success, but that doubles the amount of I/O required.

Compression doesn't work, because 1) I don't know which files will need to be deleted, and 2) the files need to be edited frequently.

Is there anything out there that can help reduce the I/O cost? Any platform will do.

EDIT: let's assume a power outage can happen anytime.

+2  A: 

Instead of moving the files, make symbolic links into the temporary directory. Then if things are OK, delete the files. Or, just make a list of the files somewhere and then delete them.

Kinopiko
+5  A: 

I think what you are really looking for is the ability to have a transaction. Because the disc can only write one sector at a time, you can only delete the files one at a time. What you need is the ability to roll back the previous deletions if one of the deletes doesn't happen successfully. Tasks like this are usually reserved for databases. Whether or not your file system can do transactions depends on which file system and OS you are using. NTFS on Windows Vista supports Transactional NTFS. I'm not too sure on how it works, but it could be useful.

Also, there is something called shadow copy for Windows, which in the Linux world is called an LVM Snapshot. Basically it's a snapshot of the disc at a point in time. You could take a snapshot directly before doing the delete, and on the chance that it isn't successfully, copy the files back out of the snapshot. I've created shadow copies using WMI in VBScript, I'm sure that similar apis exist for C/C++ also.

One thing about Shadow Copy and LVM Snapsots. The work on the whole partition. So you can't take a snapshot of just a single directory. However, taking a snapshot of the whole disk takes only a couple seconds. So you would take a snapshot. Delete the files, and then if unsucessful, copy the files back out of the snapshot. This would be slow, but depending on how often you plan to roll back, it might be acceptable. The other idea would be to restore the entire snapshot. This may or may not be good as it would roll back all changes on the entire disk. Not good if your OS or other important files are located there. If this partition only contains the files you want to delete, recovering the entire snapshot may be easier and quicker.

Kibbee
+7  A: 

On *nix, moving files within a single filesystem is a very low cost operation, it works by making a hard link to the new name and then unlinking the original file. It doesn't even change any of the file times.

If you could move the files into a single directory, then you could rename that directory to get it out of the way as a truly atomic op, and then delete the files (and directory) later in a slower, non-atomic fashion.

Are you sure you don't just want a database? They all have transaction commit and rollback built-in.

DigitalRoss
Why do you think moving is cheaper than delete?
ralu
Since I don't know ahead of time, which files will be deleted, this approach is still more expensive than mine by one rename-directory. It would be nice though.
Jurily
@ralu: all transaction systems do is record the intention and identify and coordinate the point of no return, prior to doing anything that can't be reversed. I was trying to call out a transactional approach to deleting files. It has more overhead than just chipping away with `unlink(2)` but that's the price of having a transaction. I should probably have said so, though.
DigitalRoss
DigitalRoss is right. Moving is far cheaper in the example given. As soon as you start unlinking a whole directory structure, you're going to flood the VFS, and lose any possibility of atomicity.
Matt Joiner
+1  A: 

I think the copy-and-then-delete method is pretty much the standard way to do this. Do you know for a fact that you can't tolerate the additional I/O?

I wouldn't count myself an export at file systems, but I would imagine that any implementation for performing a transaction would need to first attempt to perform all of the desired actions, and then it would need to go back and commit those actions. I.E. you can't avoid performing more I/O than doing it non-atomically.

jthg
+1  A: 

Do you have an abstraction layer (e.g. database) for reaching the files? (If your software goes direct to the filesystem then my proposal does not apply).

If the condition is "right" to delete the files, change the state to "deleted" in your abstraction layer and begin a background job to "really" delete them from the filesystem.

Of course this proposal incurs a certain cost at opening/closing of the files but saves you some I/O on symlink creation etc.

jldupont
+2  A: 

Couldn't you just build the list of pathnames to delete, write this list out to a file to_be_deleted.log, make sure that file has hit the disk (fsync()), then start doing the deletes. After all the deletes have been done, remove the to_be_deleted.log transaction log.

When you start up the application, it should check for the existence of to_be_deleted.log, and if it's there, replay the deletes in that file (ignoring "does not exist" errors).

caf
+12  A: 

Kibbee is correct: you're looking for a transaction. However, you needn't depend on either databases or special file system features if you don't want to. The essence of a transaction is this:

  1. Write out a record to a special file (often called the "log") that lists the files you are going to remove.
  2. Once this record is safely written, make sure your application acts just as if the files have actually been removed.
  3. Later on, start removing the files named in the transaction record.
  4. After all files are removed, delete the transaction record.

Note that, any time after step (1), you can restart your application and it will continue removing the logically deleted files until they're finally all gone.

Please note that you shouldn't pursue this path very far: otherwise you're starting to reimplement a real transaction system. However, if you only need a very few simple transactions, the roll-your-own approach might be acceptable.

Dale Hagglund
+1: Mark for delete; stop using. Physical delete can happen at any time after that.
S.Lott
What if something happens which causes one of the files to be undeleteable, like one of them being in use by another process. You could just wait for it to be freed, but that might take a while. How would you roll back in the case where not everything could be deleted?
Kibbee
This is exactly what I need. +1 for "Why didn't I think of that?"
Jurily
@Kibbee: in my case, a rollback is only necessary if I lose track of the delete process e.g. via reboot. The logfile makes sure that doesn't happen.
Jurily
One more point to note: knowing when the the log record is "safely written" can be an interesting problem by itself. The usual unix answer is the fsync() system call, which guarantees that the current contents of a file are fully reflected on disk. Of course, if your disks have write-caching enabled (and many consumer disks do) then you can still lose the log file on a some kinds of crash. The importance of this depends on just how critical the application in question is.
Dale Hagglund
To make this work safely, you really want to interpolate your own version of open(2) around the version found in libc, so that *all* attempts to open a logged file for read report the file as deleted. I'm not sure what you do on attempted opens for write; EBUSY perhaps?
Norman Ramsey
@Norman: you're suggestion is sort of covered by my point (2), although I didn't have wrapping open() in mind explicitly since that usually required at least a little bit of linker magic. If the program in question has it's own function for getting open file handles, perhaps something like "fd = lookupDataFile(filename)", then it's not clear to me you need to wrap open().
Dale Hagglund
+1  A: 

On Windows Vista or newer, Transactional NTFS should do what you need:

HANDLE txn = CreateTransaction(NULL, 0, 0, 0, 0, NULL /* or timeout */, TEXT("Deleting stuff"));
if (txn == INVALID_HANDLE_VALUE) {
  /* explode */
}
if (!DeleteFileTransacted(filename, txn)) {
  RollbackTransaction(txn); // You saw nothing.
  CloseHandle(txn);
  die_horribly();
}
if (!CommitTransaction(txn)) {
  CloseHandle(txn);
  die_horribly();
}
CloseHandle(txn);
bdonlan
+1  A: 

The basic answer to your question is "No.". The more complex answer is that this requires support from the filesystem and very few filesystems out there have that kind of support. Apparently NT has a transactional FS which does support this. It's possible that BtrFS for Linux will support this as well.

In the absence of direct support, I think the hardlink, move, remove option is the best you're going to get.

Omnifarious