views:

379

answers:

5

We all know there are plenty of self-balancing binary search trees (BST), being the most famous the Red-Black and the AVL. It might be useful to take a look at AA-trees and scapegoat trees too.

I want to do deletions insertions and searches, like any other BST. However, it will be common to delete all values in a given range, or deleting whole subtrees. So:

  1. I want to insert, search, remove values in O(log n) (balanced tree).
  2. I would like to delete a subtree, keeping the whole tree balanced, in O(log n) (worst-case or amortized)
  3. It might be useful to delete several values in a row, before balancing the tree
  4. I will most often insert 2 values at once, however this is not a rule (just a tip in case there is a tree data structure that takes this into account)

Is there a variant of AVL or RB that helps me on this? Scapegoat-trees look more like this, but would also need some changes, anyone who has got experience on them can share some thougts?

More precisely, which balancing procedure and/or removal procedure would help me keep this actions time-efficient?

+2  A: 

Hmm, what about B-trees? They are also balanced, and if you choose a big-order one --- it depends on how many items do you have ---, you will save a bunch of object creation/destruction times.

To 2. If you have a B-tree of order 100, you can remove up to 100 items by one function call.

To 3. This feature can be applied to almost any of the trees, just implement a RemoveSome() function that removes N items and does a rebalance. For B-trees, it's a bit trickier, but can be done.

Note: I supposed you're a programmer. If you need a complete, tested, off-the-shelf solution, you need another answer.

Denes Tarjan
Yes, I'm a programmer. B-tree would be an overkill, but it is a good idea, anyway.
Luís Guilherme
+3  A: 

Long ago in the pre-STL days I wrote my own B-Tree (BST) algorithm because I had a rather large data set at the time (roughly 700K items in 2 trees that were interdependent). I found that rebalancing after every 100-200 insertions/deletions was the peak performance I could get at the time based on experimentation on 486 and SGI hardware. This number may be different now, or maybe not since it does appear to be an algorithmic optimization limit unless you convert to a parallel model.

In short, you could apply a modification trigger for the rebalancing, and allow for forced rebalancing when you've completed all your modifications.

The improvement was remarkable. The initial straight load was not complete after 25m (killed the process). Rebalancing as we went also was killed after 15m. The restricted modification loads with a rebalance every 100 mods loaded and ran in less than 3m. Note that during the "run" portion, there were 0-8 modification to the tree per initial entry. You really need to consider whether you always need to be in-balance when the tree will be modified again in the near term.

AltGeek
Not a great answer, but a nice side-comment.
Luís Guilherme
+3  A: 

It is possible to delete a range of values a BST in O(logn + objects num).

The easiest way I know is to work with the Deterministic Skip List data structure (you might want to read a bit about this data structure before you go on). In the deterministic skip list all of the real values are stored in the bottom level, and there are pointers on upper levels to them. Insert, search and remove are done in O(logn).

The range deletion operation can be done according to the following algorithm:

  • Find the first element in the range - O(logn)
  • Go forward in the linked list, and remove all elements that are still in the range. If there are elements with pointers to the upper levels - remove them too, until reaching the topmost level (removal from a linked list) - O(number of deleted objects)
  • Fix the pointers to fit deterministic skip list (2-3 elements between every pointer upward)

The total complexity of the range delete is O(logn + number of objects in the range). Notice that if you choose to work with a random skip list, you get the same complexity, but on average, and not worst case. The plus is that you don't have to fix the upper level pointers to meet the 2-3 demand.

A deterministic skip list has a 1-1 mapping to a 2-3 tree, so with some more work, the procedure described above could work for a 2-3 tree as well.

Anna
The C++ STL containers `std::set` and `std::map` offer a method `.erase(begin, end)` for deleting ranges. Per standard, its complexity is O(log(size()) + number of objects in the range).
Jason Orendorff
@Jason, this is very cool. First of all, it's cool because there is no need to implement it yourself :), and second, it means that I can learn something new.In AVL trees, such a procedure is not simple to implement. It is worthwhile to see how red-black trees are implemented, and how it could be done there.
Anna
Thanks @Jason and @Anna, a great answer with a remarkable comment!
Luís Guilherme
+2  A: 

It should be easy to implement deleting a node and its sub nodes in an AVL tree if every node stores its height instead of a balance factor. After deleting a node keep rotating until the two child nodes differ by no more than one. Then move up the tree and repeat. The only real difference from a normal deletion will be a while instead of an if for testing the heights.

gradbot
This is, somewhat, the idea which underlines the scapegoat trees. However, it is also more flexible.
Luís Guilherme
+1  A: 

The Set implementation in the OCaml standard library is a purely functional AVL tree that satisfies all of your requirements and, in particular, has very efficient implementations of set theoretic operations (union, intersection, difference). Insertion and deletion are O(log n). You can remove subtrees and runs of elements by representing them as a set and using set difference. You can insert two elements simultaneously by creating a 2-element set and applying set union.

Jon Harrop