tags:

views:

562

answers:

4

I'm implementing a trie for predictive text entry in VB.NET - basically autocompletion as far as the use of the trie is concerned. I've made my trie a recursive data structure based on the generic dictionary class.

It's basically:

class WordTree Inherits Dictionary(of Char, WordTree)

Each letter in a word (all upper cased) is used as a key to a new WordTrie. A null character on a leaf indicates the termination of a word. To find a word starting with a prefix I walk the trie as far as my prefix goes then collect all children words.

My question is basically on the implementation of the trie itself. I'm using the dictionary hash function to branch my tree. I could use a list and do a linear search over the list, or do something else. What's the smooth move here? Is this a reasonable way to do my branching?

Thanks.

Update:

Just to clarify, I'm basically asking if the dictionary branching approach is obviously inferior to some other alternative. The application in which I'm using this data structure only uses upper case letters, so maybe the array approach is the best. I might use the same data structure for a more complex typeahead situation in the future (more characters). In that case, it sounds like the dictionary is the right approach - up to the point where I need to use something more complex in general.

+2  A: 

If it's just the 26 letters, as a 26 entry array. Then lookup is by index. It probably uses less space than the Dictionary if the bucket-list is longer than 26.

Lou Franco
A: 

A good data structure that's efficient in space and potentially gives sub-linear prefix lookups is the ternary search tree. Peter Kankowski has a fantastic article about it. He uses C, but it's straightforward code once you understand the data structure. As he mentioned, this is the structure ispell uses for spelling correction.

Hao Lian
Thanks for the great link!
Adam Bernier
A: 

I have done this (a trie implementation) in C with 8 bit chars, and simply used the array version (as alluded to by the "26 chars" answer).

HOWEVER, I am guessing that you want full unicode support (since a .NET char is unicode, among other reasons). Assuming you have to have support for unicode, the hash/map/dictionary lookup is probably your best bet, as a 64K entry array in each node won't really work very well.

About the only hack up I could think of on this is to store entire strings (suffixes or possibly "in-fixes") on branches that do not yet split, depending on how sparse the tree, er, trie, is. That adds a lot of logic to detect the multi-char strings, though, and to split them up when an alternate path is introduced.

What is the read vs update pattern?

Roboprog
As it stands now, new words are not added to the trie. It's built periodically on a batch basis. I may add functionality to capture new words as they are entered in the future. In that case it would be a more balanced read/update situation.
Steve
A: 

If you are worried about space, you can use bitmap compression on the valid byte transitions, assuming the 26char limit.

class State  // could be struct or whatever
{
    int valid; // can handle 32 transitions -- each bit set is valid
    vector<State> transitions;

    State getNextState( int ch )
    {
         int index;
         int mask  = ( 1 << ( toupper( ch ) - 'A' )) -1;
         int bitsToCount = valid & mask;

         for( index = 0; bitsToCount ; bitsToCount  >>= 1)
         {
             index  += bitsToCount  & 1;
         }  
         transitions.at( index );
    }
};

There are other ways to do the bit counting Here, the index into the vector is the number of set bits in the valid bitset. the other alternative is the direct indexed array of states;

class State
{
    State transitions[ 26 ]; // use the char as the index.

    State getNextState( int ch )
    {
        return transitions[ ch ];
    }
};
sfossen