views:

295

answers:

7

I'm trying to work out how to efficiently sort a huge dataset that won't fit in memory. The obvious answer at a high level is to sort a whole bunch of chunks that do fit in memory using some standard algorithm, write these out to disk, and then merge them. Merging them is the problem.

Let's say the data divides up into C chunks, so I have C files to merge. If I do a C-way merge in one pass, then technically I have an O(N^2) algorithm, though one that only has to perform O(N) writes to disk. If I iteratively merge them into C/2 files, then C/4 files, etc. then I have an O(N log N) algorithm, but one that has to perform O(N log N) writes to disk, and therefore has a huge constant term.

What is the typical solution to this conundrum? Is there any good one?

+1  A: 

Why not look at the problem from a different perspective. For example, if you happen to be sorting names, make a pass, sorting anything that begins with A-F, a second pass sorting strings that begin with G-M, etc. Then the results can simply be appended in order. The disadvantage is that the data must be read from disk C times.

zildjohn01
This is an interesting idea. Given that disk read is so much faster than disk write, I wonder how it'd compare to the classic algorithms.
Mark Bessey
+1  A: 

Why aren't you using the algorithms in http://www.amazon.com/Art-Computer-Programming-Sorting-Searching/dp/0201896850 ?

They're quite good, and carefully explained.

S.Lott
I'm not sure you can assume every poster on SO has the same books on their bookshelf that you do! Is there a particular algorithm you want to recommend? Can you give perhaps a hint of how it applies to this particular issue?
Peter Recore
@Peter: My point was a bit more general. If you're tackling sorting, you simply *must* buy this book.
S.Lott
+3  A: 

A good solution is external sorting. Specifically check out the external mergesort algorithm.

External sorting is a term for a class of sorting algorithms that can handle massive amounts of data. External sorting is required when the data being sorted do not fit into the main memory of a computing device (usually RAM) and instead they must reside in the slower external memory (usually a hard drive). The typical external sorting algorithm uses a sort-merge strategy, which starts by sorting small subfiles. The basic algorithm consist of two phases: the sorting phase and the merging phase. In the sorting phase, the subfiles can fit in the available buffer space are read into main memory, sorted using an internal sorting algorithm, and written back to disk as temporary sorted subfiles. In the merging phase, the sorted subfiles are merged during one or more passes.

Nick D
His question was how to do an external sort (though he apparently didn't know that name). Answering that he should use external sorting might give him a starting point for Googling, but seems (at least to me) to fall somewhat short of deserving even a single up-vote.
Jerry Coffin
@Jerry Coffin, I posted that wikipedia entry because it describes the external mergesort algorithm.
Nick D
+1  A: 

Nick is right, use external sorting. Your C-way merge doesn't imply O(N^2), by the way. Use a priority queue for the merge and it is still O(N lg N).

You might also look at cache oblivious algorithms for sorting.

Keith Randall
+5  A: 

The simple answer is that there is no simple answer to this question. There are lots of answers, most of them fairly complex -- Knuth volume 3 (for one example) devotes a great deal of space to it.

One thing that becomes obvious when looking through what's been done is that you really want to minimize the number of files you create during your initial sorting, and maximize the length of each. To do that, you generally want to read in about as much data as you can fit in memory, but instead of just sorting it and writing it out, you want to put it into a heap. Then as you write each record out, you read IN another record and put it into your heap. As you write each subsequent record from the heap to the file, you check whether it's larger than the existing records. If not, you remove it from the heap, and insert it into another heap. Then continue with the next smallest record in the first heap. You stop putting records into the current file when the first heap is completely empty, and your second heap is taking up all your memory. At that point, you start putting records into a new file, and basically "swap" the uses of the two heaps.

This will produce considerably longer intermediate files in the initial phase, so merging them is substantially less work.

Edit: I certainly didn't invent this -- I probably first read about it in Knuth, but perhaps in Algorithms + Data Structures = Programs (Niklaus Wirth) -- both discuss it. Knuth credits first publication of the method to "H. Seward", in his masters thesis at MIT in 1954. If you have the second edition of Knuth, it's on page 254 of volume 3. I've never gotten around to getting a copy of the third edition, so I don't have a page number for that.

Jerry Coffin
Sounds like a very good solution. It's worth mentioning that the heap you are referring to is the data structure described in http://en.wikipedia.org/wiki/Heap_%28data_structure%29 and not the heap used in i.e. C for dynamic memory allocation. Also, it would be nice to know the origin of the algorithm - is it your own invention?
gooli
+1  A: 

It's funny as I heard this same question not a month ago... and the response that our local guru gave as well.

"Use the unix sort command"

Though we admitedly thought it was a joke at the expense of the asker... it turns out that it was not. The reasoning is that those smart guys already gave a lot of thought in how to solve the problem of very large files, and came up with a very impressive implementation which makes good use of the available resources.

Therefore, unless you plan in re-inventing the wheel: ie you have time and this is business critical, then simply using the unix sort is probably an excellent idea.

The only drawback is its arcane syntax. This page is dedicated to the command and various explanations.

My personal advise: take a small sample of the data for testing that the command effectively does exactly what you want.

Matthieu M.
A: 

Are you sorting in place or creating a new copy? If you are sorting in place, then memory mapped IO is usually a good option. Just map your entire file, and perform a merge sort on it. The OS will keep as much of the file in memory, and depending on the data set, will generally minimize your IO.

If you do write your own sorting algorithm, one trick is to reverse your direction after each pass. So, if one your first pass, you start from beginning to end, then go from end to beginning on your second pass. If split your files into parts A, B, C, and D, then after sorting C and D, you should merge C and D, and not go back to A and B. The reason of course is your OS will page parts of the files into memory, and you want to use the cache as much as possible.

brianegge