views:

415

answers:

3

One of my favourite data structures in college was the Trie. It's a great data structure for holding a large set of strings if the prefixes are shared. Lookups are also nice, since they are done at O(|length|) of the string regardless of how many strings are in the set.

By comparison, a balanced tree would be O(log N) in the number of set items, plus whatever you pay for comparisons. A hash table would involve the hash calculation, comparison, etc.

It is therefore surprising to me that there is no Trie implementation in the standard library of most languages.

The only reason I could come up with was the possibility that memory access costs make it too expensive. Rather than investigating O(log N) locations if doing a tree lookup, you are not doing O(|length|) different locations, with all the consequences. If the strings are long, this could end up being way too much.

So I'm wondering: how much is what I just described a concern? What do you do when you need to store a large set or map of strings?

+4  A: 

You could just build two sample apps and see which one performs better. Memory access is cheap assuming you don't page fault. Then it's very expensive. For client application development, its almost always better to process than to access memory for this very reason. Modern processors are ridiculously fast, but cache misses still hurt.

jeffamaphone
I used to work at Intel for several years, so I'm extremely paranoid even about falling out of the same cache line. Besides, if each node is located elsewhere on the heap, unless my garbage collector is reorganizing stuff I may very well page fault.
Uri
Good luck writing any search algorithm that stays in the same cache line!
Andrew Grant
I've actually once did that with fixed graphs for fun (think subway maps) and got an impressive speedup... :)
Uri
+6  A: 

I hadn't thought of this as an area of concern before, but now that you mention it, there are times when a standard Trie implementation might be handy. On the other hand, as far as I know, Tries are used by Python and Perl and other string-savvy languages that I use now.

Last I checked, which was ages ago, the BSD kernel code used Tries (Patricia Tries) in the code to select the best interface for sending packets. Looks like Wikipedia has some info.

Craig S
+1  A: 

I did some performance testing in C# with a Trie and a Dictionary (a strongly typed hash table). I found that the Dictionary was 5-10 times faster than the Trie. Perhaps my implementation of the Trie could be optimized a bit, but hardly enough to be much faster than (or perhaps even as fast as) the Dictionary.

The ContainsKey method in the dictionary is close to an O(1) operation (depending on how good the hashing algorithm is), so it's not easy to make a collection that beats that as long as the hashing algorithm is reasonably fast.

With a custom IEqualityComparer you can use most anything as a key in a Dictionary, which makes it rather flexible. A Trie is a bit more limited in what you can use as key, so that limits the usefulness somewhat.

Guffa
Of course the hash has faster access. The advantage of a trie over a hash is memory efficiency.