views:

344

answers:

5

Ok, so this is something that's always bothered me. The tree data structures I know of are:

  • Unbalanced binary trees
  • AVL trees
  • Red-black trees
  • 2-3 trees
  • B-trees
  • B*-trees
  • Tries
  • Heaps

How do I determine what kind of tree is the best tool for the job? Obviously heaps are canonically used to form priority queues. But the rest of them just seem to be different ways of doing the same thing. Is there any way to choose the best one for the job?

A: 

Each of these has different complexity for insertion, deletion and retrieval, All have mostly O log(n) access times.

rep_movsd
+5  A: 

The same as any other data structure, you have to know the characteristics (complexity of search, insert, and delete operations) of each type of tree, and the requirements of the job you're selecting a tool for. The tree that has the best performance for the type of operations you'll do most often is usually the best tool for the job.

You can usually find the general characteristics for any kind of data structure on Wikipedia. Introduction to Algorithms also has at least a section (in some cases a whole chapter) on most of the data structures you've listed, so it's another good reference.

Bill the Lizard
I was planning on digging through that book eventually. I just find that it's usually easier to get an overall idea of what I should be paying attention to before getting into it. :-)
Jason Baker
@Jason: I do the same thing, checking Wikipedia and SO first, before poring through printed materials.
Bill the Lizard
A: 

Each tree has specific characteristics which make them usefull in a certain way. You should compare there characteristics with the needs you have.

Ikke
+13  A: 

Let’s pick them off one by one, shall we?

  • Unbalanced binary trees

For search tasks, never. Basically, their performance characteristics will be completely unpredictable and the overhead of balancing a tree won’t be so big as to make unbalanced trees a viable alternative.

Apart from that, unbalanced binary trees of course have other uses, but not as search trees.

  • AVL trees

They are easy to develop but their performance is generally surpassed by other balancing strategies because balancing them is comparatively time-intensive. Wikipedia claims that they perform better in lookup-intensive scenarios because their height is slightly less in the worst case.

  • Red-black trees

These are used inside most of C++’ std::map implemenations and probably in a few other standard libraries as well. However, there’s good evidence that they are actually worse than B(+) trees in every scenario due to caching behaviour of modern CPUs. Historically, when caching wasn’t as important (or as good), they surpassed B trees when used in main memory.

  • 2-3 trees
  • B-trees
  • B*-trees

These require the most careful consideration of all the trees, since the different constants used are basically “magical” constans which relate in weird and sometimes unpredictable way to the underlying hardware architecture. For example, the optimal number of child nodes per level can depend on the size of a memory page or cache line.

I know of no good, general rule to distinguish between them.

  • Tries

Completely different. Tries are also search trees, but for text retrieval of substrings in a corpus. A trie is an uncompressed prefix tree (i.e. a tree in which the paths from root to leaf nodes correspond to all the prefixes of a given string).

Tries should be compared to, and offset against, suffix trees, suffix arrays and q-gram indices – not so much against other search trees because the data that they search is different: instead of discrete words in a corpus, the latter index structures allow a factor search.

  • Heaps

As you’ve already said, they are not search trees at all.

Konrad Rudolph
+1 Very nice answer.
Kyle Rozendo
+3  A: 

Similar question: When to choose RB tree, B-Tree or AVL tree?

Offhand, I'd say, write the simplest code that could possibly work (availing yourself of library-provided data structures if possible). Then measure its performance problems, if any.

If your performance needs are really extreme, read Konrad Rudolph's awesome answer. :)

Jason Orendorff