tags:

views:

223

answers:

3
+2  Q: 

Trie vs B+ tree

How does Trie and B+ tree compare for indexing lexicographically sorted strings [on the order some billions]? It should support range queries as well.

From perf. as well as implementation complexity point of view.

A: 

Wikipedia has some algorithmic complexity facts: B+ tree (section Characteristics), Trie (unfortunately spread all over the article). Hope that helps.

thSoft
A: 

Depends on your actual task.

If you want to get all subtree then B+Tree is your choice because of it is space efficient. But if you want to get first N children from substree, that Trie is the best choice because of you simply visit less nodes than in B+ Tree scenario. The most popular task which is handled good by Trie is a word prefix completion.

dotsid
+2  A: 

I would say it depends on what you mean by Range.

If your range is expressed as All words beginning by, then a Trie is the right choice I'd say. On the other hand, Trie are not meant for requests like All words between XX and ZZ.

Note that the branching factor of the B+ Tree affects its performance (the number of intermediary nodes). If h is the height of the tree, then nmax ~~ bh. Therefore h ~~ log(nmax) / log(b).

With n = 1 000 000 000 and b = 100, we have h ~~ 5. Therefore it means only 5 pointer dereferencing for going from the root to the leaf. It's more cache-friendly than a Trie.

Finally, B+ Tree is admittedly more difficult to implement than a Trie: it's more on a Red-Black Tree level of complexity.

Matthieu M.
If you are smart about your Trie implementation than "all words between xx and zz" isn't that difficult. If you are storing the edges in lexicographical order then the strings are in lexicographical order as well.
Niki Yoshiuchi
It's a bit more difficult though to exploit the range. In a `B+ Tree` a range can be defined by two pointers (begin/end) and you can iterate through them like in a deque. In a `Trie` you have to implement iteration (from a random pointer to another) in order to be able to do the same, it's less natural, though of course not infeasible and I am afraid less efficient. Or you can just copy the range in another structure, but that could be costly.
Matthieu M.