views:

114

answers:

3

I am curious to know what is the reasoning that could overweighs towards using a self-balancing tree technique to store items than using a hash table.

I see that hash tables cannot maintain the insertion-order, but I could always use a linked list on top to store the insertion-order sequence.

I see that for small number of values, there is an added cost of of the hash-function, but I could always save the hash-function together with the key for faster lookups.

I understand that hash tables are difficult to implement than the straight-forward implementation of a red-black tree, but in a practical implementation wouldn't one be willing to go an extra mile for the trouble?

I see that with hash tables it is normal for collisions to occur, but with open-addressing techniques like double hashing that allow to save the keys in the hash table itself, hasn't the problem been reduced to the effect of not tipping the favor towards red black trees for such implementations?

I am curious if I am strictly missing a disadvantage of hash table that still makes red black trees quite viable data structure in practical applications (like filesystems et al.)

A: 

In my humble opinion, self-balancing trees work pretty well as Academic topics. And I do not know anything that can be qualified as a "straight-forward implementation of a red-black tree".

In the real world, the memory wall makes them far less efficient than they are on paper.

With this in mind, hash tables are decent alternatives, especially if you don't practice them the Academic style (forget about the table size constraint and you magically resolve the table resize issue and almost all collision issues).

In a word: keep it simple. If that's simple for you then that's simple for your computer.

Pierre
Could you explain how you can "forget about the table size contraint?" Are you just suggesting we ignore it, or do you mean something else?
Odrade
A: 

Here is what I can think of:

  1. There are kinds of data which cannot be hashed (or is too expensive to hash), therefore cannot be stored in hash tables.
  2. Trees keep data in the order you need (sorted), not insertion order. You can't (effectively) do that with hash table, even if you run a linked list through it.
  3. Trees have better worst-case performace
unbeli
1. If you can't produce a hash key, how do you produce a key to determine where the node is placed in the tree? If you can produce a key for the node placement, why can't you use that for the hash key? 2. Why can't you effectively do this with a hash table + linked list? Can you provide more explanation? Remember that a linked list is essentially just a tree optimized for ordering. 3. Trees have a best case of log(N). Hashing is always constant. Collisions have the same effect on both structures. How can trees have better worst-case performance?
Jake
@Jake 1- by comparing elements. Having order on something doesn't mean you can hash something. 2- Because. 3- There is nothing to collide in trees.
unbeli
@unbeli 1. The item you're comparing is the key. All binary data can be hashed by some method, and all data in a computer is binary data. I suppose it could get expensive if the keys are huge, but that would likely have the same effect on the comparison function of the tree. 3. Search trees have ordered nodes. You can only order a list if the list has a set of keys in some form on which to order the elements. If two nodes have the same key, that is a collision.
Jake
1- not true. 3- not true about the keys: you don't need any keys for the search tree. If you insert an entry, which is equal to an existing entry, there is no performance impact, contrary to the hash collision. It is your homework to understand why.
unbeli
+1  A: 

Storage allocation is another consideration. Every time you fill all of the buckets in a hash-table, you need to allocate new storage and re-hash everything. This can be avoided if you know the size of the data ahead of time. On the other hand, balanced trees don't suffer from this issue at all.

Odrade
but the amortized cost of the operation still remains O(1), so do you really think it's a disadvantage?
Vaibhav Bajpai
In most cases, probably not. In real-time applications, yes.
Odrade