views:

208

answers:

6

I need to write a sorting program in C and it would be nice if the file could be sorted in place to save disk space. The data is valuable, so I need to ensure that if the process is interrupted (ctrl-c) the file is not corrupted. I can guarantee the power cord on the machine will not be yanked.

Extra details: file is ~40GB, records are 128-bit, machine is 64-bit, OS is POSIX

Any hints on accomplishing this, or notes in general?

Thanks!

To clarify: I expect the user will want to ctrl-c the process. In this case, I want to exit gracefully and ensure that the data is safe. So this question is about handling interrupts and choosing a sort algorithm that can wrap up quickly if requested.

+3  A: 

Use heap sort, and prevent interruptions (e.g. block signals) during each swap operation.

R..
Yeah. Simple, and easy. Also make changes in memory and then write/flush them all at once (block signals for this part).
Matt Joiner
Heapsort requires hitting all over the file, which may not be good for anything that doesn't all fit into physical memory.
David Thornley
On the other hand, if the file is giant like in your case, big-O and not the constant factor from io cache misses is going to dominate. Heap sort is the only in-place sort with better than quadratic big-O.
R..
A: 

Backup whatever you plan to change. The put a flag that marks a successful sort. If everything is OK then keep the result, otherwise restore backup.

Victor Hurdugaci
-1 does not meet the in-place criterion of the question, which clearly stated there's a huge dataset and perhaps no space to back it up.
R..
R: You don't need to save the whole file. Just save small pieces that are used in sort. I never said that the whole file must be backed up.
Victor Hurdugaci
+11  A: 

Jerry's right, if it's just Ctrl-C you're worried about, you can ignore SIGINT for periods at a time. If you want to be proof against process death in general, you need some sort of minimal journalling. In order to swap two elements:

1) Add a record to a control structure at the end of the file or in a separate file, indicating which two elements of the file you are going to swap, A and B.

2) Copy A to the scratch space, record that you've done so, flush.

3) Copy B over A, then record in the scratch space that you have done so, flush

4) Copy from the scratch space over B.

5) Remove the record.

This is O(1) extra space for all practical purposes, so still counts as in-place under most definitions. In theory recording an index is O(log n) if n can be arbitrarily large: in reality it's a very small log n, and reasonable hardware / running time bounds it above at 64.

In all cases when I say "flush", I mean commit the changes "far enough". Sometimes your basic flush operation only flushes buffers within the process, but it doesn't actually sync the physical medium, because it doesn't flush buffers all the way through the OS/device driver/hardware levels. That's sufficient when all you're worried about is process death, but if you're worried about abrupt media dismounts then you'd have to flush past the driver. If you were worried about power failure, you'd have to sync the hardware, but you're not. With a UPS or if you think power cuts are so rare you don't mind losing data, that's fine.

On startup, check the scratch space for any "swap-in-progress" records. If you find one, work out how far you got and complete the swap from there to get the data back into a sound state. Then start your sort over again.

Obviously there's a performance issue here, since you're doing twice as much writing of records as before, and flushes/syncs may be astonishingly expensive. In practice your in-place sort might have some compound moving-stuff operations, involving many swaps, but which you can optimise to avoid every element hitting the scratch space. You just have to make sure that before you overwrite any data, you have a copy of it safe somewhere and a record of where that copy should go in order to get your file back to a state where it contains exactly one copy of each element.

Jerry's also right that true in-place sorting is too difficult and slow for most practical purposes. If you can spare some linear fraction of the original file size as scratch space, you'll have a much better time of it with a merge sort.

Based on your clarification, you wouldn't need any flush operations even with an in-place sort. You need scratch space in memory that works the same way, and that your SIGINT handler can access in order to get the data safe before exiting, rather than restoring on startup after an abnormal exit, and you need to access that memory in a signal-safe way (which technically means using a sig_atomic_t to flag which changes have been made). Even so, you're probably better off with a mergesort than a true in-place sort.

Steve Jessop
I'd think about using a memory mapped file for the scratch space. It should give you the best of both worlds. 1) Low overhead for accesses because it's cached and no API callls 2) If the process dies the OS should still flush the modified memory to disk regardless of how the process dies.
torak
@torak: Great news, I didn't know POSIX provided that guarantee for mmap. I still have a worry that accesses to the file could be reordered, though, making it unreliable for recovery. So all the pointers might need to be to `volatile` or something.
Steve Jessop
You would need to use `msync()` with `MS_SYNC` at each flush point. `msync()` also ought to imply a compiler barrier, so `volatile` shouldn't be necessary.
caf
+5  A: 

The part for protecting against ctrl-c is pretty easy: signal(SIGINT, SIG_IGN);.

As far as the sorting itself goes, a merge sort generally works well for external sorting. The basic idea is to read as many records into memory as you can, sort them, then write them back out to disk. By far the easiest way to handle this is to write each run to a separate file on disk. Then you merge those back together -- read the first record from each run into memory, and write the smallest of those out to the original file; read another record from the run that supplied that record, and repeat until done. The last phase is the only time you're modifying the original file, so it's the only time you really need to assure against interruptions and such.

Another possibility is to use a selection sort. The bad point is that the sorting itself is quite slow. The good point is that it's pretty easy to write it to survive almost anything, without using much extra space. The general idea is pretty simple: find the smallest record in the file, and swap that into the first spot. Then find the smallest record of what's left, and swap that into the second spot, and so on until done. The good point of this is that journaling is trivial: before you do a swap, you record the values of the two records you're going to swap. Since the sort runs from the first record to the last, the only other thing you need to track is how many records are already sorted at any given time.

Jerry Coffin
+1 - doesn't fit the the in-place demand, but it's IMHO the sanest approach. Doesn't modify master file while sorting chunks, and if the merge process crashes you've got the sorted chunk-files as backup.
snemarch
Instead of ignoring `SIGINT`, you should **block** it (and all other signals) during swap operations, but periodically unblock it so that pending signals which arrived while blocked can be processed.
R..
+3  A: 

Install a handler for SIGINT that just sets a "process should exit soon" flag.

In your sort, check the flag after every swap of two records (or after every N swaps). If the flag is set, bail out.

caf
This sounds like the best solution to me. Some of the other recommendations give me the impression that Ctrl-C is to be ignored if it isn't pressed at just the right time, which seems like a poor user-experience to me.
Jeffrey L Whitledge
This looks like the winner. It appears to apply to quicksort and heapsort. Thanks!
Full Decent
It won't save your data from kill -9 or a power failure though.
Zan Lynx
A: 

Assuming a 64-bit OS (you said it is a 64bit machine but could still be running 32bit OS), you could use mmap to map the file to an array then use qsort on the array.

Add a handler for SIGINT to call msync and munmap to allow the app to respond to Ctrl-C without losing data.

Martin Thomas
-1 for incorrect answer. If a signal interrupts `qsort`, the data is in an indeterminate state until `qsort` finishes. There is absolutely no way to use the system's `qsort` to meet the OP's needs.
R..