Disk Search / Sort Algorithm

views:

169

answers:

Disk Search / Sort Algorithm

Given a Range of numbers say 1 to 10,000, Input is in random order. Constraint: At any point only 1000 numbers can be loaded to memory.

Assumption: Assuming unique numbers.

I propose the following efficient , "When-Required-sort Algorithm".

We write the numbers into files which are designated to hold particular range of numbers. For example, File1 will have 0 - 999 , File2 will have 1000 - 1999 and so on in random order.

If a particular number which is say "2535" is being searched for then we know that the number is in the file3 (Binary search over range to find the file). Then file3 is loaded to memory and sorted using say Quick sort (which is optimized to add insertion sort when the array size is small ) and then we search the number in this sorted array using Binary search. And when search is done we write back the sorted file.

So in long run all the numbers will be sorted.

Please comment on this proposal.

First of all, you need to clarify your goals. You seems to be do two different things, and we need to know what the primary goal, and which is just a nice to have side effect.

James Curran 2010-03-30 03:44:49

+6 A:

It's called Bucket sort.

Another approach when main memory is limited is to use Merge sort.

The part of your design where you sort each bucket on demand may be better described as "on demand", "just-in-time", or "lazy". Might as well reuse nomenclature people are already familiar with instead of inventing the term "When-required-sort".

Have you considered how to handle additional input? What happens if some of the buckets are already sorted, and then more numbers are added?

I assume the end goal is to identify if a number is included in the set, rather than to produce a sorted list. If you do this frequently there is benefit to the initial overhead of sorting a bucket. If infrequently, a linear scan of the appropriate bucket may suffice.

One more alternative. Bucket sort can be thought of as a simplistic hash table. The hash function is n/1000. Collisions are expected since there can be a large number of values hashed into each bucket (up to 1000). Instead of using on-demand sorting (and then binary search) to resolve collisions, you could use a more sophisticated hash and get O(1) search performance.

Dan 2010-03-30 03:48:30

+1 A:

The previous poster's description is correct - this is a bucket sort.

Some closely related sorts are Radix sorts. These are O(1) but dependent on a fairly uniform distribution of values within the range.

Larry Watanabe 2010-03-30 03:53:45

+2 A:

Each number can be from 1 to 10000. That means each number occupies at least 14 bits (2¹³ = 8192, 2¹⁴ = 16384).

You have the ability to load 1000 numbers into memory. That means you can use a bit mask since you've stated that the numbers are unique. Set up a bit mask of 10000 bits which, at 14 bits per number, is only 715 numbers (at most, less if you have more than 14 bits per number).

Initially clear the bits to indicate no numbers exist, then read the numbers one at a time, setting the relevant bit to indicate that it exists. This is an O(n) operation.

Then, once you have that bit array set up, it's an O(1) operation to see if a particular bit is set.

Even the best sorting algorithm won't give you better than O(n) on random data.

paxdiablo 2010-03-30 04:08:49

Quantum bogosort is also O(n). http://en.wikipedia.org/wiki/Bogosort#Quantum_bogosort

Roger Pate 2010-03-30 04:38:48

That's fine if you're in the universe where the bogosort *worked*. What happens to all those poor souls in the countless other universes? :-)

paxdiablo 2010-03-30 05:24:23

Use mergesort:
http://en.wikipedia.org/wiki/Sorting_algorithm

Memory consumption of mergesort is n, while bucketsort's is n*k.
And worst case for bucketsort is n^2*k, while mergesort's is n*ln(n)

And note this: In almost any case where you have to sort a large number of numbers, mergesort is the most efficient sorting algorithm for the task.

Quandary 2010-03-30 07:28:40

I read you question like this "Given input of n numbers from domain D, what is the fastest way to write down sorted input of those n numbers, provided that you can store only k numbers (k < n) in memory? Provide algorithm for n = 10000, k = 1000."

Note, in your question you say that domain D is a range from 1 to 10000. I believe that is an oversimplification. With n = 10000 and input being a range (no repetition), this becomes trivial as you will know exactly where each number should be written in the sorted file. Furthermore you know exactly what are the contents of that file and you don't have to write it at all and you don't have to read the input. :D

Now if the N(D) is not equal n or if you allow repetition then the problem becomes a bit more interesting.

If the memory is limited, I think the intuitive approach is to do this:

1st approach

Reading the input you will be able to sort at most k1 elements before writing them down, where k1 is the number of element which will require k elements in memory to be sorted.

You will end up with f = (n div k1) + 1 files which are internally sorted.

Then you will need to read from f files and merge the partially sorted data writing them down into a final file.

Different sorts have different memory requirements and will produce a different number of partially sorted files that will have to be merged.

Merging more files will require more memory because you will not know in which file you can find the next number.

2nd approach

Another approach is, as you suggest, to know in which file you can find the next number. It is like putting them in buckets based on their size (distributing the sort by classifying), but the problem there is that unless you know how is your data distributed it will not be easy to determine the range of each bucket.

The size of each bucket should be again k1 for least number of files.

Assuming that you know something about your data distribution this could be done, otherwise you will need another pass over your data to establish the cutting points.

For general data where the size of a bucket is not known and you can not first pass over all of the data you (for example if you have to keep some sort of sorted structure for your data as the input is coming in and you don't know what will come next) you would basically have to keep an index such as B+ tree, but this is not optimal. Indexes are optimized for fast retrieval and (some of them) for insertion of small number of new elements.

3rd approach
Having such a small domain allows to simply count the numbers and write their frequency down. If you can have random access to the output files the file system buffering can take care of the efficiency (buffering is an algorithm that does efficient disk writes limited by memory usage, the only problem is if the size of the buffer is less then k numbers and if the chosen bitmap like structure is the most efficient).

Intuitively I would say that the best bet would be to first calculate distribution and calculate the size and limits for each bucket. Then divide the file into buckets. Then sort each bucket. I guess that some performance could be squeezed out by at least partially sorting the data while writing them into buckets.

Unreason 2010-03-31 10:10:02

ansaurus

tags:

views:

answers:

Disk Search / Sort Algorithm

related questions