Sorting huge Number of Integers from hard disk

views:

187

answers:

+1 Q:

Sorting huge Number of Integers from hard disk

Given 100 GB integer Data on Hard Disk with RAM amounting to 2 GB, how to sort the integers with minimal disk operation. Here fetching one number from disk is considered as one disk operation( though in reality a block of data can be fetched).

We can use additional space on the disk for temporary storage and no need to consider the operations of cleaning up temporary spaces used.

+1 A:

Merge Sort is a popular approach when it comes to limited memory

Mihir Mathuria 2010-10-25 07:30:21

Actually no, merger sort's main disadvantage over quicksort and heapsort is requirement of additional memory, which roughly equals to twice as much as source data memory.

Gunner 2010-10-26 07:25:06

Gunner, actually, no :) you are probably thinking about main memory sorting algorithms, which is not the discussion here.

Dimitris Andreou 2010-10-26 15:00:10

+1 A:

100GB of integer data means you'll have a large number of duplicate data. I'd personally choose a (bucketsort/selection) / mergesort approach as my first instinct if I'm trying to minimize disk I/O.

First read a bit under 1 Gig of data into memory, mergesort that data in memory. Flush to disk. Repeat for each chunk of memory. Then you can walk each chunk of data and grab all the 0s, repeat for each integer. It's going to take a long time, but that's only 203GB Read and 200GB write worst case (theoretical).

OmnipotentEntity 2010-10-25 07:30:35

If you have 2GB of RAM, why only read 1GB at a time?

Gabe 2010-10-25 07:34:55

Merge sort requires O(n) additional memory.

OmnipotentEntity 2010-10-25 07:36:03

You're trying to minimize disk operations and have no limit on CPU operations. You can do a merge in O(1) additional space if you do your merge in O(n^2) CPU time. Personally, though, I would just read in 2GB and QuickSort it.

Gabe 2010-10-25 07:46:04

That's true, and it does optimize the disk usage slightly in the selection sort phase.

OmnipotentEntity 2010-10-25 07:50:48

I think the RAM based sorting doesn't matter much. We can use Heapsort with constant amount of additional memory use and no worst case scenario of quicksort. I choose heapsort, because it's inplace.

Gunner 2010-10-25 08:04:16

Wait, quicksort can be performed in-place too.

Dimitris Andreou 2010-10-26 14:46:44

I would go with a pure mergesort. O(1) main memory space (streaming the sorted output to secondary memory), So the whole memory can be used for input/output buffering (reading/writing in big chunks). I don't understand where the O(n^2) Gabe mentions comes from, merging is O(n).

Dimitris Andreou 2010-10-26 14:53:32

By the way, reading in ~2GB chunks and sorting via quicksort sounds nice to create the initial sorted runs (and then continue via mergesort), much better than creating runs of 2, 4, 8, etc. till it gets to 2Gb worth of sorted data.

Dimitris Andreou 2010-10-26 15:06:12

+2 A:

I think that a for fast algorithm another 100GB of free HDD space are precondition.

Just use any sort on 2GB chunks and put them back. Now you have 50 sorted chunks in file, and you cand use the merge sort as suggested by Mihir on them. Write output buffer as it fills in the output file. You'll just have to fine-tune the input and output buffer sizes.

There are some solutions with counting. It cannot be used on such large range and maximum possible count. You could only store QWORD counters on disk, but this means many random accesses, that will certainly slower than working with larger buffers.

ruslik 2010-10-25 10:41:11

The first answer that came to my mind is actually this one. But, now it seems that, counter based solutions are more convenient.

Gunner 2010-10-26 07:24:12

@Gunner but how?

ruslik 2010-10-26 07:36:44

Look at the post by Mark Synowiec.

Gunner 2010-10-26 07:39:10

If you split the input into 50 equal size pieces, Merge Sort is a good answer. You get 400 GB total I/O - 100 GB to read the input, 100 GB to write the 50 files, 100 GB to read all 50 files back in, and 100 GB to produce the output.

Mark Ransom 2010-10-26 22:25:20

@Gunner, since you never specified the size of the integers it seems premature to say a counting solution would be practical. Even if it is, the worst case performance will be dismal.

Mark Ransom 2010-10-26 22:29:33

+1 A:

To me the answer to this question depends cruically on the expected distribution of the numbers in the file.

There are 12.5 Billion ints in 100 Gigs of int data. There are also only ~4.3 Billion distinct ints.

Given a perfectly uniform distribution across all possible ints you'd expect each int to show up roughly 3 times give or take. This low level of duplication does not warrant changing from a standard sort routine (one that sorts chunks at a time and then merges the chunks together).

However, if we restrict the "file ints" to all be non-negative then we immediately expect each valid int to appear roughly 6 times. This is approaching a level of duplication that may warrant a change in sorting routine. So, I think you should ask the interviewer if anything more can be assumed about the distribution of ints on disk. After all, it would be odd to have 100GB worth of data and have no idea if it exhibits any predictable pattern.

Ivan 2010-10-25 14:46:50

This is actually interview question and the interviewer is probably interested in seeing how one approaches the problem. Therefore, we should not expect some kind of pattern in the data. On the issue anyway, I wonder if there is ever a need to sort such a huge amount of data in real life.

Gunner 2010-10-25 16:19:10

Yeah, I get that you wrote out the actual interview question. But you should ask in the interview if the numbers in the file come from one distribution or another. Because having that knowledge (or not) has dramatic implications -- you should show that you realize that.

Ivan 2010-10-25 17:54:51

@Ivan: Good point, most interviewers actually expect the interviewee to ask few clarification questions. Those signify a lot on how the person thinks and deals with problems presented.

Gunner 2010-10-26 07:22:40

I thought about that, but counting is good only when range is restricted. It should be able to handle any input, and even size of DWORD isn't enough to store the max possible count.

ruslik 2010-10-26 07:35:51

+2 A:

Since the data being sorted is integer type (4 bytes) and the amount of data is 100 GB (where GB is 2^30), you'd have 26,843,545,600 integers to sort. Since you have 4,294,967,296 possible integer values, you can represent this data as an array of longs which serve as counters, which would consume about 34 GB of disk space. Read through the 100 GB of data once, incrementing the individual counters for each possible integer value (300 GB total disk access to read the value, read the counter, write the incremented counter), then read the counters in order, writing out the number of values that you read of each value (134 GB total disk access).

This would sort the data using a total of 434 GB of disk access. If you use RAM to store part of the range of integer value counters, you could technically lower the amt of disk access even more.

Mark Synowiec 2010-10-25 16:54:30

This looks to be a nice answer. Read all elements and count them and write the results back. I guess counting sort is the way to go if we want to reduce disk access.

Gunner 2010-10-25 17:36:25

There are 2^32 32-bit integers, and 8 bytes in a long, so it would take *exactly* 32GB (where GB is 2^30) to store all the counters. However, each counter requires only 35 bits to store up to 26,843,545,600, so you need 2^32*35/8 bytes, or under 18GB to hold the counters. Furthermore, you can use your 2GB of RAM to cache frequently used counters, reducing your disk usage even more.

Gabe 2010-10-25 19:11:09

@Gabe: Yes, keeping some values in memory would also improve the performance. Another possibility could be to actually keep the counters in memory until we reach a point where we won't be able to accommodate any more. In that case, we flush these and update the counters on the disk.

Gunner 2010-10-26 07:20:57

I'm afraid that in this case the disk access has to be measured as number of accesses and not as traffic (although the traffic could be also huge). HDDs has horrible seek times, and SSDs aren't used largely (and they have poor writing speed).

ruslik 2010-10-26 07:43:12

I think this would be terribly slow if you have to read/write to the HD for every single integer. As Gabe said, caching would improve results, but if you've only got 2GB of RAM you would only halve the number of read/writes to HD (~5e10). I think that having a 2GB array of int64 counters in RAM (since we could potentially have the same number 26e9 times) and sweeping through the HD 100 times, ignoring those integers that are not within the array bounds, would be faster. Of course, if you know something about the distribution of integers you could improve further.

Dijkstra 2010-10-26 10:38:19

@Dijkstra Normally read functions have buffering built in, so sequentially reading single integers won't create any issues. Reading all 100 GB 100 times, however, would be much slower than reading it once using this algorithm.

Mark Synowiec 2010-10-26 13:58:53

@Mark It's not the reading of the integers that will slow you, it's incrementing the counters that will be slow. Imagine you read the integers 0, 1<<31,0. Each of those will require a HD write to increment, since you can't store all your counters in 2GB RAM at the same time.

Dijkstra 2010-10-26 14:04:46

I was about to say what "Dijkstra" said. This is terribly slow. Estimating I/O overhead just by the size of data accessed is completely bogus. Random accesses are slow, whereas sequential scanning (of large parts) is *very* efficient. This is why mergesort is king in secondary-memory sorting; the heavy-loading is done just by sequential reads and writes.

Dimitris Andreou 2010-10-26 14:42:21

Sure, but the question was stating a simplified view of I/O overhead ("fetching one number from disk is considered as one disk operation"). I just assumed the simplification could be applied to writing as well. Plus the question doesn't say anything about speed of disk access, only minimizing disk operation. I do agree though that in a real world scenario, utilizing RAM to minimize writes would be necessary.

Mark Synowiec 2010-10-26 14:55:19

In a real world scenario, random I/O is MUCH more expensive than batch I/O, and caching will help but not in the worst case (you can force every increment to be a cache miss).

Ssancho 2010-10-26 18:02:51

+2 A:

As other people have noted, you can use an O(n) counting sort. However there are some additional problems you need to consider. We'll assume you're storing 32-bit integers, so 100GB ~ 27e9 ints.

If all the integers are the same, then it will occur ~27e9 times, which is larger than a 32 bit int. Therefore your counters will have to be 64-bit integers.

With 2GB of RAM, you can only store ~125e6 counters in RAM at any one time. If we can't make any assumptions about the distribution of integers, we would either have to:

individually increment the counters on the HD, or
ignore all integers we encounter that are not in the counter array we currently have stored in RAM.

I think the latter option is better. Since we need ~4e9 64-bit counters and can only store 2GB, we would need to run through the entire array ~16 times. The first option is clearly no good if we consider encountering a sequence of integers such as 0,1<<31,0. These counters would not be stored in RAM at the same time, and thus at least 2 HD writes are required.

Because of this, I think for the specific size of your problem (100GB), an N-way merge sort would be better, as this would only require reading the entire array log_2(100) ~ 8 times.

However, if the interviewer immediately changed the question to "10TB array, still 2GB of RAM", then the counting sort would easily win.

Dijkstra 2010-10-26 14:21:57

ansaurus

tags:

views:

answers:

Sorting huge Number of Integers from hard disk

related questions