views:

116

answers:

5

Randomized binary search trees like treap give a good performance (in the order of O(log n)) with high probability while avoiding complicated (and costly) rebalancing operations that are needed for deterministic balanced trees like AVL, red-blackm, AA, etc.

We know that if we add random keys to a simple BST, we can expect it is reasonably balanced. A simple reason is that the number of heavily non-balanced trees for n nodes it's much lower than the number of "almost balanced" trees and hence, a random order for the insertion of the keys is likely to end up with an acceptable tree.

In this case, in "The Art of Computer Programming", Knuth gives a little bit more than 1.3*lg2(n) as the average length of a path which is rather good. He also says that deleting a random key from the a random tree preserves its randomness (and hence its good average balancing).

It seems, then, that a binary search tree where keys are inserted and deleted in a random order would most likely give performance in the order of O(log n) for all three operations: search, insert and delete.

That said, I wonder if the following approach would give the same good properties:

  • take an hash function h(x) that is known to be "good" (e.g. it ensure an even spread of the keys)
  • use the order set by h(x) on the keys instead of the ordering on k.
  • in case of collision, order according the key. That should be rare if the hash key is good enough and the range of the hash function is much bigger than the set of the keys.

To give an example a BST for the key { 4, 3, 5 , 1, 2} inserted in that order, would be:

                  4
                 / \
                3   5
               /\
              1  2

Assuming the hash function would map them to (respectively) {221,142,12,380,18) we would get.

                    221(4)
                   /   \
              142(3)  380(1)
             /    \
           12(5) 18(2)

The key point is that a "regular" BST may degenerate because the keys are inserted according the same ordering relation that is used to store them in the tree (their "natural" ordering, for example the alphabetical order of string) but the hash function induces an ordering on the keys that is completely unrelated to the "natural" one and, hence, should give the same results as if the keys were inserted in random order.

A strong assumption is that the hash function is "good", but it's not an unreasonable one, I think.

I didn't find any reference to a similar approach in the literature so it might be completely wrong, but I can't see why!

Do you see any drawback in my reasoning? Anyone has already attempted doing it?

A: 

Isn't this just one way to store a hash table?

Hogan
+2  A: 

Sounds reasonable to me. Have you searched to see if this has already been formalized or at least noted?

Regarding drawbacks: I suppose one possible objection would be: if you already have paid the price for running a hash function why not just use a hash table?".

A related objection is that you have already tied the time complexity to the distribution properties of the hash function, in which case the tree doesn't add very much over a hash table. I like trees but hash tables are generally faster. This means that the main advantage of the hashed tree is that it uses the full range of the hash function whereas the hash table throws much of it away in the modulus op.

DigitalRoss
A: 

Though it usually uses something like a B-tree for the storage, this is generally pretty similar to how extendible hashing works. And yes, it does generally work quite well.

Jerry Coffin
+3  A: 

I think what you are suggesting is to simply order using hash values, relying on the spread of hash values for a balanced tree. This works, and it should give you adequately balanced trees in practice with a good hash function.

The reason we don't see other people using something like this, IMO, is that if you order by hash function, your data structure is no longer sorted. Yes, it is still sorted by hash function, but the element with the smallest hash function is usually not the element you would need to search for, whereas searches like the smallest/largest/k-th element are often useful. Since the data structure would no longer have this property, it makes a lot more sense to have a hash table that uses the hash function to store in an array to get O(1) performance instead of O(log n).

MAK
A: 

2 should be child of 1 not 3

ashish