views:

917

answers:

9

I have a set of double-precision data and I need their list to be always sorted. What is the best algorithm to sort the data as it is being added?

As best I mean least Big-O in data count, Small-O in data count (worst case scenario), and least Small-O in the space needed, in that order if possible.

The set size is really variable, from a small number (30) to lots of data (+10M).

A: 

By a "set of double data," do you mean a set of real-valued numbers? One of the more commonly used algorithms for that is a heap sort, I'd check that out. Most of its operations are O( n * log(n) ), which is pretty good but doesn't meet all of your criteria. The advantages of heapsort is that it's reasonably simple to code on your own, and many languages provide libraries to manage a sorted heap.

James Thompson
+2  A: 

Likely a heap sort. Heaps are only O(log N) to add new data, and you can pop off the net results at any time in O(N log N) time.

If you always need the whole list sorted every time, then there's not many other options than an insertion sort. It will likely be O(N^2) though with HUGE hassle of linked skip lists you can make it O(N log N).

SPWorley
+23  A: 

Building a self-balancing binary tree like a red-black tree or AVL tree will allow for Θ(lg n) insertion and removal, and Θ(n) retrieval of all elements in sorted order (by doing a depth-first traversal), with Θ(n) memory usage. The implementation is somewhat complex, but they're efficient, and most languages will have library implementations, so they're a good first choice in most cases.

Additionally, retreiving the i-th element can be done by annotating each edge (or, equivalently, node) in the tree with the total number of nodes below it. Then one can find the i-th element in Θ(lg n) time and Θ(1) space with something like:

node *find_index(node *root, int i) {
  while (node) {
    if (i == root->left_count)
      return root;
    else if (i < root->left_count)
      root = root->left;
    else {
      i -= root->left_count + 1;
      root = root->right;
    }
  }
  return NULL; // i > number of nodes
}

An implementation that supports this can be found in debian's libavl; unfortunately, the maintainer's site seems down, but it can be retrieved from debian's servers.

bdonlan
Yes, the right answer is some variant of tree.
Steven Sudit
or in general any self-balancing binary search tree (http://en.wikipedia.org/wiki/Self-balancing_binary_search_tree)
newacct
The B+ tree is significantly faster than the standard red-black tree for trees containing more than 16,000 entries. See: http://idlebox.net/2007/stx-btree/
lkessler
I think the question is a little general to state a specific "best" answer. We haven't got a full set of requirements for the problem yet.
NoMoreZealots
Indeed, red-black is just a good all-round solution for this problem, not necessarily the best for all cases.
bdonlan
red black trees retrieve in O(N log N) not O(n).The advantage of red black trees is cheaper deletions... but that's not a cost the original question worried about.
SPWorley
Doesn't take into account pulling out a specific nth element. Doesn't address the full problem space.
NoMoreZealots
@Arno, reading a list of all items is Θ(n) by doing an in-order traversal. @Pete, Pulling out a specific nth element can be made to run in Θ(lg n) by annotating edges with the number of nodes below.
bdonlan
Yes, you are right. But if you leave that out everything else is available on Wikipedia. The point is, that's a relavent detail that's required to make the answer something different from the text book answer.
NoMoreZealots
@Pete, sorry, didn't see that the questioner had updated to mention the i-th element stuff. Answer updated :)
bdonlan
No problem, you have a good answer. I was just trying to point out we need the relavent requirements first.
NoMoreZealots
+1  A: 

Check out the comparison of sorting algorithms in Wikipedia.

kek444
+3  A: 

The structure that is used for indexes of database programs is a B+ Tree. It is a balanced bucketed n-ary tree.

From Wikipedia:

For a b-order B+ tree with h levels of index:

  • The maximum number of records stored is n = b^h
  • The minimum number of keys is 2(b/2)^(h−1)
  • The space required to store the tree is O(n)
  • Inserting a record requires O(log-b(n)) operations in the worst case
  • Finding a record requires O(log-b(n)) operations in the worst case
  • Removing a (previously located) record requires O(log-b(n)) operations in the worst case
  • Performing a range query with k elements occurring within the range requires O(log-b(n+k)) operations in the worst case.

I use this in my program. You can add your data to the structure as it comes and you can always traverse it in order, front to back or back to front, or search quickly for any value. If you don't find the value, you will have the insertion point where you can add the value.

You can optimize the structure for your program by playing around with b, the size of the buckets.

An interesting presentation about B+ trees: Tree-Structured Indexes

You can get the entire code in C++.


Edit: Now I see your comment that your requirement to know the "i-th sorted element in the set" is an important one. All of a sudden, that makes many data structures less than optimal.

You are probably best off with a SortedList or even better, a SortedDictionary. See the article: Squeezing more performance from SortedList. Both structures have a GetKey function that will return the i-th element.

lkessler
+2  A: 

I would use a heap/priority queue. Worst case is same as average case for runtime. Next element can be found in O(log n) time.

Here is a templatized C# implementation that I derived from this code.

hughdbrown
+1  A: 

Ok, you want you data sorted, but you need to extract it via an index number.

Start with a basic Tree such as the afforementioned Red-Black trees.

Modify the tree algo such that as you insert elements into the tree all nodes encountered during insertion and deletion keep a count of the number of elements under each branch.

Then when you are extracting data from the tree you can calculate the index as you go, and know which branch to take based on whether is greater or less than the index you are trying to extract.

One other consideration. 10M elements+ in a tree that uses dynamic memory allocation will suck up alot of memory overhead. i.e. The pointers may take up more space than your actual data, plus whatever other member is used to implement the data structure. This will lead to serious memory fragmentation, and in the worst cases, degrade the system's overall performance. (Churning data back and forth from virtual memory.) You might want to consider implementing a combination of block and dynamic memory allocation. Something where in you sort the tree into blocks of data, thus reducing the memory overhead.

NoMoreZealots
A: 

Randomized Jumplists are interesting as well. They require less space as BST and skiplists. Insertion and deletion is O(log n)

Luzifer42
+1  A: 

If you just need to know the ith smallest element as it says in the comments, use the BFPRT algorithm which is named after the last names of the authors: Blum, Floyd, Pratt, Rivest, and Tarjan and is generally agreed to be the biggest concentration of big computer science brains in the same paper. O(n) worst-case.

Rui Ferreira
If what appear at Wikipedia is correct its worst case is O(n^2). With modifications it seems to be O(nlogn), the same as using a tree. But frankly Wikipedia does a pretty bad job at describing the algorithm. Do you have a better source?
Wilhelm
The objective of the algorithm is to get the ith smallest element at one point and it takes O(n). The alternatives with sorting and trees are O(nlogn). Basically you sort it and get the ith element.A different problem is if you want to do it at each moment. Then I think the best aternatives are already suggested above.
Rui Ferreira
If you look on wikipedia, the BFPRT includes the "Median of Medians algorithm" part. I'm guessing the previous section is just a introduction to the algorithm and its relation to quicksort.
Rui Ferreira
And check the MIT open course ware. I think they have a class in order statistics.
Rui Ferreira