tags:

views:

158

answers:

5

Hi all,

Did you make some experiences with these structures mentioned above? If the insertion and lookup matters, then what seems to be best in practice?

If the hash table has lots of collisions, you'd end up with a bucket list you need to traverse, so basically you could end up in O(n) if performance matters. I've made no experiences with the radix tree so far and I've head that the red black tree beats the AVL tree regarding lookups.

What is your experience with that?

thanks, Dan

+4  A: 

I don't think I've ever had a situation where I've written code with a hashtable, and the structure has proved to have inherently unacceptable performance.

I don't think I've ever had a situation where I've written code with a balanced tree, and the structure has proved to have inherently unacceptable performance.

So I don't worry about it too much, and take the easy route. Python provides hash-based sets/dictionaries, so I use those. C++03 provides set and map but not unordered_set and unordered_map, so I use those. In languages/libraries where both are available, I use a tree if I need order, and a hashtable if (a) I don't need order, and (b) I have or can write a decent hash function. If I have a natural order already, but not a natural hash function, then of course the easy route is a tree.

In practice, the vast majority of associative arrays are keyed either on strings, integers, or tuples of those. So you always have an order, and a decent hash function isn't far off. If the one I'm using seems at all dodgy, it's therefore easy enough to try them both.

Radix trees are basically for cases where you expect long common prefixes. For example, if your keys are URLs of web pages, then somewhere between "almost all" and "all" of them are going to start with either http:// or https://, with hostnames providing further common prefixes for sections of your tree. An obvious optimisation, therefore, is to avoid comparing that common prefix multiple times during a lookup, which indicates against a "plain" balanced tree. But as always, the "obvious" optimisation isn't necessarily worth the effort, unless you happen to have a library for it already to hand.

Steve Jessop
+1  A: 

This is a question that has to be answered "It depends." It depends on a lot of factors. Hash tables can take up more space than a tree-based structure so sometimes you need to sacrifice quick look-up for a smaller memory footprint. Sometimes, you have to have the fast look-up. Sometimes you will be retrieving a list of items from x to y, so an ordered tree-based structure makes more sense. Sometimes you need the nearest neighbors, a la a BK-tree. Sometimes... You get the point, no?

This is what they mean by a good algorithm beats all. Your choice will be impacted by the problem domain, the things you're going to need to do with it, and such.

wheaties
A: 

I think you are missing the point with this question. There is no one-size-fits-all data structure that is best for all possible scenarios. Each structure has its own strengths and weaknesses.

For example, as you already know, hash tables have an expected O(1) lookup and insert/remove performance. The downside is that this does not take into account the hidden constant of having to actually computing the hash of the values which can be quite expensive (e.g. when using large strings). Collisions should be exceedingly rare if you use a good hash function, but even small amounts of collisions do add up and affect performance.

On the other hand, Balanced Binary Search trees (Red-black and AVL) support O(log n) operations, which is indeed slower, but have the added advantage of keeping the elements sorted - thus enabling a lot more operations (e.g. min, max, k-th element, upper-bound, lower-bound, range queries etc.) that are simply not possible in an unordered structure such as a hash table. In practice, the difference in speed between the two is often negligible, even for operations involving millions of elements. The speed difference between Red-black and AVL-trees is even less noticeable, both being O(log n) and only differing by a small constant factor.

Then there are lots of other data structures like segment trees, Binary Indexed Trees (AKA Fenwick trees), and Radix Trees which you mentioned. Please read a good book on data structures (CLRS is recommended) or at least read the wikipedia article on each, understand how they work and the cost of operations on them, and decide which one suits your job.

The bottom line is what data structure you chose depends on what you need it for. There is no one structure that is the absolute best and would satisfy any scenario you could possibly come up with.

MAK
Splay trees are always fun, since to estimate their performance you have to model the distribution of keys involved in lookups. This generally provides the kick up the backside necessary to actually run a test instead of guessing ;-)
Steve Jessop
A: 

Between R-B and AVL trees, you need to look at frequency of searches vs. insertion/deletion. An AVL tree is more nearly balanced, to it does searches somewhat faster, at the expense of insertion/deletion being somewhat slower. Realistically, the difference will usually be negligible unless one is a lot more frequent than the other.

It's almost impossible to comment on hash tables. At its most basic, a hash table has a fixed size. For general-purpose use, however, that's generally considered unacceptable. To remain usable across a wide range of sizes, most hash tables do some sort of re-hashing or incremental resizing. Exactly how that's done typically has a substantial effect on speed (e.g., in my experience, a simple, fixed-size hash table is typically around an order of magnitude faster within its limits).

A great deal also depends on the nature of the keys you're working with. The comparison for a tree can look at only part of a key, and make a decision as soon as a mismatch is found between the two keys. A decent hash, but comparison, is always based on the entire key. If you have a small collection with large keys (e.g., long strings) you easily finish a tree search in less time than you can finish computing a hash. OTOH, if you have a compound key in which the primary fields are often equal, a comparisons can slow down in a hurry.

Jerry Coffin
A: 

I did quite a lot of experiments on this last year and found, to my surprise, that it was very hard to beat 234-trees (AVL trees are essentially 23-trees). Every data structure had its sweet spot, but you had to have quite a lot of data in your collection before you started outperforming those 234-trees. And the implementation of these 234-trees was non-destructive, to boot.

That said, for large amounts of data, use a hash table.

Rafe