views:

691

answers:

5

I'm struggling with the concept of when to use binary search trees and when to use dictionaries.

In my application I did a little experiment which used the C5 library TreeDictionary (which I believe is a red-black binary search tree), and the C# dictionary. The dictionary was always faster at add/find operations and also always used less memory space. For example, at 16809 <int, float> entries, the dictionary used 342 KiB whilst the tree used 723 KiB.

I thought that BST's were supposed to be more memory efficient, but it seems that one node of the tree requires more bytes than one entry in a dictionary. What gives? Is there a point at where BST's are better than dictionaries?

Also, as a side question, does anyone know if there is a faster + more memory efficient data structure for storing <int, float> pairs for dictionary type access than either of the mentioned structures?

+1  A: 

It seems to me you're doing a premature optimization.

What I'd suggest to you is to create an interface to isolate which structure you're actually using, and then implement the interface using the Dictionary (which seems to work best).

If memory/performance becomes an issue (which probably will not for 20k- numbers), then you can create other interface implementations, and check which one works bests. You won't need to change almost anything in the rest of the code (except which implementation you're using).

Samuel Carrijo
A: 

It does make sense that a tree node would require more storage than a dictionary entry. A binary tree node needs to store the value and both the left and right subtrees. The generic Dictionary<TKey, TValue> is implemented as a hash table which - I'm assuming - either uses a linked list for each bucket (value plus one pointer/reference) or some sort of remapping (just the value). I'd have to have a peek in Reflector to be sure, but for the purpose of this question I don't think it's that important.

The sparser the hash table, the less efficient in terms of storage/memory. If you create a hash table (dictionary) and initialize its capacity to 1 million, and only fill it with 10,000 elements, then I'm pretty sure it would eat up a lot more memory than a BST with 10,000 nodes.

Still, I wouldn't worry about any of this if the amount of nodes/keys is only in the thousands. That's going to be measured in the kilobytes, compared to gigabytes of physical RAM.


If the question is "why would you want to use a binary tree instead of a hash table?" Then the best answer IMO is that binary trees are ordered whereas hash tables are not. You can only search a hash table for keys that are exactly equal to something; with a tree, you can search for a range of values, nearest value, etc. This is a pretty important distinction if you're creating an index or something similar.

Aaronaught
But the C# dictionary is a hashtable that automatically adjusts its size right? So by not prespecifying its size it will eventually allocate a little more than 10,000 buckets and will probably still use less memory than a tree with exactly 10,000 nodes with faster access times. Unless increasing the size of the dictionary is very slow for a large amount of elements I still don't see the advantage of trees over dictionaries.
Projectile Fish
@Projectile Fish: Generally, when you plan to populate a large dictionary, you initialize it with a specific capacity so that you don't incur the performance penalty associated with auto-growing (this is the same with almost all the generic collections). As long as your capacity estimate isn't way off, then yes, it will still likely be more memory-efficient than a tree, especially with large data sets.
Aaronaught
@Projectile Fish: I also added in a few lines to answer your second question, namely what would be the advantage of a tree over a dictionary.
Aaronaught
That makes sense. I just did a little test and it seems the auto-growing cost isn't so large. An auto-growing dictionary is still faster by 10x and more memory efficient by 2x than a TreeDictionary. So I guess the only reason to use a BST would be if I needed ordered data.
Projectile Fish
A: 

The interface for a Tree and a Hash table (which I'm guessing is what your Dictionary is based one) should be very similar. Always revolving around keyed lookups.

I had always thought a Dictionary was better for creating things once and then then doing lots of lookups on it. While a Tree was better if you were modifying it significantly. However, I don't know where I picked that idea up from.

(Functional languages often use trees as the basis for they collections as you can re-use most of the tree if you make small modifications to it).

A: 

You're not comparing "apples with apples", a BST will give you an ordered representation while a dictionary allows you to do a lookup on a key value pair (in your case ).

I wouldn't expect much size in the memory footprint between the 2 but the dictionary will give you a much faster lookup. To find an item in a BST you (potentially) need to traverse the entire tree. But to do a dictnary lookup you simply lookup based on the key.

nixon
But what is involved in "simply looking up based on the key"? With a BST, if it is relatively balanced, a lookup will be pretty quick, O(log(n)) i think?
Snarfblam
a lookup on a hastable would be closer to O(1) wouldn't it ? dependant on the implementation, space etc... but would defintely be quicker than a BST.
nixon
+2  A: 

I thought that BST's were supposed to be more memory efficient, but it seems that one node of the tree requires more bytes than one entry in a dictionary. What gives? Is there a point at where BST's are better than dictionaries?

I've personally never heard of such a principle. Even still, its only a general principle, not a categorical fact etched in the fabric of the universe.

Generally, Dictionaries are really just a fancy wrapper around an array of linked lists. You insert into the dictionary something like:

LinkedList<Tuple<TKey, TValue>> list =
    internalArray[internalArray % key.GetHashCode()];
if (list.Exists(x => x.Key == key))
    throw new Exception("Key already exists");
list.AddLast(Tuple.Create(key, value));

So its nearly O(1) operation. The dictionary uses O(internalArray.Length + n) memory, where n is number of items in the collection.

In general BSTs can be implemented as:

  • linked-lists, which use O(n) space, where n is the number items in the collection.
  • arrays, which use O(2h - n) space where h is the height of the tree and n is the number of items in the collection.
    • Since red-black trees have a bounded height of O(1.44 * n), an array implementation should have a bounded memory usage of about O(21.44n - n)

Odds are, the C5 TreeDictionary is implemented using arrays, which is probably responsible for the wasted space.

What gives? Is there a point at where BST's are better than dictionaries?

Dictionaries have some undesirable properties:

  • There may not be enough continugous blocks of memory to hold your dictionary, even if its memory requirements are much less than than the total available RAM.

  • Evaluating the hash function can take an arbitrarily long length of time. Strings, for example, use Reflector to examine the System.String.GetHashCode method -- you'll notice hashing a string always takes O(n) time, which means it can take considerable time for very long strings. On the hand, comparing strings for inequality almost always faster than hashing, since it may require looking at just the first few chars. Its wholly possible for tree inserts to be faster than dictionary inserts if hash code evaluation takes too long.

    • Int32's GetHashCode method is literally just return this, so you'd be hardpressed to find a case where a hashtable with int keys is slower than a tree dictionary.

RB Trees have some desirable properties:

  • You can find/remove the Min and Max elements in O(log n) time, compared to O(n) time using a dictionary.

  • If a tree is implemented as linked list rather than an array, the tree is usually more space efficient than a dictionary.

  • Likewise, its ridiculous easy to write immutable versions of trees which support insert/lookup/delete in O(log n) time. Dictionaries do not adapt well to immutability, since you need to copy the entire internal array for every operation (actually, I have seen some array-based implementations of immutable finger trees, a kind of general purpose dictionary data structure, but the implementation is very complex).

  • You can traverse all the elements in a tree in sorted order in constant space and O(n) time, whereas you'd need to dump a hash table into an array and sort it to get the same effect.

So, the choice of data structure really depends on what properties you need. If you just want an unordered bag and can guarantee that your hash function evaluate quickly, go with a .Net Dictionary. If you need an ordered bag or have a slow running hash function, go with TreeDictionary.

Juliet