views:

77

answers:

2

Hello,

Say I have 50 million features, each feature comes from disk.

At the beggining of my program, I handle each feature and depending on some conditions, I apply some modifications to some.

A this point in my program, I am reading a feature from disk, processing it, and writing it back, because well I don't have enough ram to open all 50 million features at once.

Now say I want to sort these 50 million features, is there any optimal algorithm to do this as I can't load everyone at the same time?

Like a partial sorting algorithm or something like that?

+6  A: 

In general, the class of algorithms you're looking for is called external sorting. Perhaps the most widely known example of such sorting algorithm is called Merge sort.

The idea of this algorithm (the external version) is that you split the data into pieces that you can sort in-place in memory (say 100 thousands) and sort each block independently (using some standard algorithm such as Quick sort). Then you take the blocks and merge them (so you merge two 100k blocks into one 200k block) which can be done by reading elements from both of the block into buffers (since the blocks are already sorted). At the end, you merge two smaller blocks into one block which will contain all the elements in the right order.

Tomas Petricek
@Tomas, a bit off-topic but there are two little typo's in your bio: you wrote `abou` instead of `about` and `functinal` instead of `functional`.
Bart Kiers
@Bart: Fixed, thanks!
Tomas Petricek
No problemo! I tried finding the `edit` button... :)
Bart Kiers
A: 

If you are on Unix, use sort ;)

It may seem stupid but the command-line tool has been programmed to handle this case and you won't have to reprogram it.

Matthieu M.