ansaurus

Question

Answer 1

A:

By a "set of double data," do you mean a set of real-valued numbers? One of the more commonly used algorithms for that is a heap sort, I'd check that out. Most of its operations are O( n * log(n) ), which is pretty good but doesn't meet all of your criteria. The advantages of heapsort is that it's reasonably simple to code on your own, and many languages provide libraries to manage a sorted heap.

James Thompson 2009-07-19 18:48:55

Answer 2

+2 A:

Likely a heap sort. Heaps are only O(log N) to add new data, and you can pop off the net results at any time in O(N log N) time.

If you always need the whole list sorted every time, then there's not many other options than an insertion sort. It will likely be O(N^2) though with HUGE hassle of linked skip lists you can make it O(N log N).

SPWorley 2009-07-19 18:49:14

Answer 3

+23 A:

Building a self-balancing binary tree like a red-black tree or AVL tree will allow for Θ(lg n) insertion and removal, and Θ(n) retrieval of all elements in sorted order (by doing a depth-first traversal), with Θ(n) memory usage. The implementation is somewhat complex, but they're efficient, and most languages will have library implementations, so they're a good first choice in most cases.

Additionally, retreiving the i-th element can be done by annotating each edge (or, equivalently, node) in the tree with the total number of nodes below it. Then one can find the i-th element in Θ(lg n) time and Θ(1) space with something like:

node *find_index(node *root, int i) {
  while (node) {
    if (i == root->left_count)
      return root;
    else if (i < root->left_count)
      root = root->left;
    else {
      i -= root->left_count + 1;
      root = root->right;
    }
  }
  return NULL; // i > number of nodes
}

An implementation that supports this can be found in debian's libavl; unfortunately, the maintainer's site seems down, but it can be retrieved from debian's servers.

bdonlan 2009-07-19 18:51:47

Yes, the right answer is some variant of tree.

Steven Sudit 2009-07-19 19:04:37

or in general any self-balancing binary search tree (http://en.wikipedia.org/wiki/Self-balancing_binary_search_tree)

newacct 2009-07-19 19:05:34

The B+ tree is significantly faster than the standard red-black tree for trees containing more than 16,000 entries. See: http://idlebox.net/2007/stx-btree/

lkessler 2009-07-19 19:23:21

I think the question is a little general to state a specific "best" answer. We haven't got a full set of requirements for the problem yet.

NoMoreZealots 2009-07-19 19:23:49

Indeed, red-black is just a good all-round solution for this problem, not necessarily the best for all cases.

bdonlan 2009-07-19 19:32:05

red black trees retrieve in O(N log N) not O(n).The advantage of red black trees is cheaper deletions... but that's not a cost the original question worried about.

SPWorley 2009-07-19 19:35:47

Doesn't take into account pulling out a specific nth element. Doesn't address the full problem space.

NoMoreZealots 2009-07-19 20:49:13

@Arno, reading a list of all items is Θ(n) by doing an in-order traversal. @Pete, Pulling out a specific nth element can be made to run in Θ(lg n) by annotating edges with the number of nodes below.

bdonlan 2009-07-19 21:26:32

Yes, you are right. But if you leave that out everything else is available on Wikipedia. The point is, that's a relavent detail that's required to make the answer something different from the text book answer.

NoMoreZealots 2009-07-19 21:36:19

@Pete, sorry, didn't see that the questioner had updated to mention the i-th element stuff. Answer updated :)

bdonlan 2009-07-19 21:45:04

No problem, you have a good answer. I was just trying to point out we need the relavent requirements first.

NoMoreZealots 2009-07-20 16:29:13

Answer 4

+1 A:

Check out the comparison of sorting algorithms in Wikipedia.

kek444 2009-07-19 18:52:31

Answer 5

+3 A:

The structure that is used for indexes of database programs is a B+ Tree. It is a balanced bucketed n-ary tree.

From Wikipedia:

For a b-order B+ tree with h levels of index:

The maximum number of records stored is n = b^h
The minimum number of keys is 2(b/2)^(h−1)
The space required to store the tree is O(n)
Inserting a record requires O(log-b(n)) operations in the worst case
Finding a record requires O(log-b(n)) operations in the worst case
Removing a (previously located) record requires O(log-b(n)) operations in the worst case
Performing a range query with k elements occurring within the range requires O(log-b(n+k)) operations in the worst case.

I use this in my program. You can add your data to the structure as it comes and you can always traverse it in order, front to back or back to front, or search quickly for any value. If you don't find the value, you will have the insertion point where you can add the value.

You can optimize the structure for your program by playing around with b, the size of the buckets.

An interesting presentation about B+ trees: Tree-Structured Indexes

You can get the entire code in C++.

Edit: Now I see your comment that your requirement to know the "i-th sorted element in the set" is an important one. All of a sudden, that makes many data structures less than optimal.

You are probably best off with a SortedList or even better, a SortedDictionary. See the article: Squeezing more performance from SortedList. Both structures have a GetKey function that will return the i-th element.

lkessler 2009-07-19 19:11:13

Answer 6

+2 A:

I would use a heap/priority queue. Worst case is same as average case for runtime. Next element can be found in O(log n) time.

Here is a templatized C# implementation that I derived from this code.

hughdbrown 2009-07-19 19:32:33

Answer 7

+1 A:

Ok, you want you data sorted, but you need to extract it via an index number.

Start with a basic Tree such as the afforementioned Red-Black trees.

Modify the tree algo such that as you insert elements into the tree all nodes encountered during insertion and deletion keep a count of the number of elements under each branch.

Then when you are extracting data from the tree you can calculate the index as you go, and know which branch to take based on whether is greater or less than the index you are trying to extract.

One other consideration. 10M elements+ in a tree that uses dynamic memory allocation will suck up alot of memory overhead. i.e. The pointers may take up more space than your actual data, plus whatever other member is used to implement the data structure. This will lead to serious memory fragmentation, and in the worst cases, degrade the system's overall performance. (Churning data back and forth from virtual memory.) You might want to consider implementing a combination of block and dynamic memory allocation. Something where in you sort the tree into blocks of data, thus reducing the memory overhead.

NoMoreZealots 2009-07-19 20:43:33

Answer 8

A:

Randomized Jumplists are interesting as well. They require less space as BST and skiplists. Insertion and deletion is O(log n)

Luzifer42 2009-07-20 07:32:41

Answer 9

+1 A:

If you just need to know the ith smallest element as it says in the comments, use the BFPRT algorithm which is named after the last names of the authors: Blum, Floyd, Pratt, Rivest, and Tarjan and is generally agreed to be the biggest concentration of big computer science brains in the same paper. O(n) worst-case.

Rui Ferreira 2009-07-21 23:53:34

If what appear at Wikipedia is correct its worst case is O(n^2). With modifications it seems to be O(nlogn), the same as using a tree. But frankly Wikipedia does a pretty bad job at describing the algorithm. Do you have a better source?

Wilhelm 2009-07-22 08:14:44

The objective of the algorithm is to get the ith smallest element at one point and it takes O(n). The alternatives with sorting and trees are O(nlogn). Basically you sort it and get the ith element.A different problem is if you want to do it at each moment. Then I think the best aternatives are already suggested above.

Rui Ferreira 2009-07-22 11:02:13

If you look on wikipedia, the BFPRT includes the "Median of Medians algorithm" part. I'm guessing the previous section is just a introduction to the algorithm and its relation to quicksort.

Rui Ferreira 2009-07-22 11:06:35

And check the MIT open course ware. I think they have a class in order statistics.

Rui Ferreira 2009-07-22 11:27:57

ansaurus

tags:

views:

answers:

Best continuously sorting algorithm?

related questions