tags:

views:

131

answers:

6

Is there any existing sorting algorithms that include paging?

For example. What if my records are too large to fit into memory, how would you handle something like that?

+5  A: 

Mergesort works well with paging. It's also useful for tape drives.

http://en.wikipedia.org/wiki/Merge_sort

glowcoder
+2  A: 

You can use a disk based merge sort in these scenarios.

http://en.wikipedia.org/wiki/Merge_sort#Merge_sorting_tape_drives

btreat
+1  A: 

You can sort the array of indexes, so to speak.

ruslik
A: 

A easy way would be to figure out how many items you can work with at the same time that fits in the memory. Then take that amount of item, sort them, save them. Take a new block of items sort them and insert them 1 by 1 on your save list via an insertion sort. Repeat until you sorted it all.

Another way would be to work with let say 10 block of items (the 10 blocks needs to fit in the memory) sort them all in their block. Then insert 1 by 1 the "smallest" element of all 10 block into you final list. When all 10 blocks are empty, generate 10 more. Example: 3 blocks of 3 items:

1- 8, 14, 23 2- 1, 90, 99 3- 0, 3, 4

you final list would be: 0,1,3,4,8,14,23,90,99

then you would refill those 3 blocks

Hope it help

Lobsterm
-1 sorry as this will be very slow on large datasets. It's much faster to merge 2 sorted blocks than to insert elements one at a time from one into the other (O(n) vs. O(n^2)).
j_random_hacker
+3  A: 

The term for this is External Sorting. A process of sorting and then merging is the usual approach. As others have said, mergesort is the canonical example.

ire_and_curses
+1  A: 

External sorting algorithms are extremely sensitive to the number of independent devices you have available to you. As mentioned above, mergesort is an excellent alternative; simple, transparent, easily understood, and with reasonable performance—excellent performance on a small number of drives.

If you have a large number of drives available, a distribution-sort (like quicksort, but with multiple partitions) often exhibits much better cache efficiency. If all you need is to sort a moderate amount of data, I encourage you to go with a naive mergesort. It will work well, and will be easy to implement. However if you have a lot of drives, and a large amount of data, I encourage you to read Vitter's survey article External memory algorithms and data structures: dealing with massive data Which includes an excellent introductory section on the research done into external sorting algorithms.

Oh, and don't do any of the above if running unix sort(1) will do the job.

Recurse