views:

282

answers:

5

I've got a list of 10000 keywords. What is an efficient search algorithm to provide auto-completion with that list?

+2  A: 

I think binary search works just fine for 10000 entries.

Georg
+1  A: 

A trie: http://en.wikipedia.org/wiki/Trie gives you O(N) search time whenever you type a letter (I'm assuming you want new suggestions whenever a letter is typed). This should be fairly efficient if you have small words and the search space will be reduced with each new letter.

IVlad
A: 

As you’ve already mentioned that you store your words in a database (see http://stackoverflow.com/questions/2331919), create an index of that words and let the database do the work. They know how to do that efficiently.

Gumbo
+3  A: 

Using a Trie is an option but they are space-inefficient. They can be made more space effecient by using a modified version known as a Radix Tree, or Patricia Tree.

A ternary search tree would probably be a better option. Here is an article on the subject: "Efficient auto-complete with a ternary search tree." Another excellent article on the use of Ternary Search Tree's for spelling-correction (a similar problem to auto-complete) is, "Using Ternary DAGs for spelling correction."

Simucal
Note that instead of having a fixed-sized array in each node of the Trie, one could just store a dynamic-sized array, extending it for each letter added that is not present (linked-list ?)
Matthieu M.
A: 

A rather roundabout method was proposed on SO for crosswords.

It could be adapted here fairly easy :)

The idea is simple, yet quite efficient: it consists on indexing the words, building one index per letter position. It should be noted that after 4/5 letters, the subset of words available is so small that a brute force is probably best... this would have to be measured of course.

As for the idea, here is a Python way:

class AutoCompleter:
  def __init__(self, words):
    self.words = words
    self.map = defaultdict(set)
    self._map()

  def _map(self):
    for w in words:
      for i in range(0,len(w)):
        self.map[(i,w[i])].insert(w)

  def words(self, firstLetters):
    # Gives all the sets for each letter
    sets = [self.map[(i, firstLetters[i])] for i in range(0, len(firstLetters))]

    # Order them so that the smallest set is first
    sets.sort(lambda x,y: cmp(len(x),len(y)))

    # intersect all sets, from left to right (smallest to biggest)
    return reduce(lambda x,y: intersection(x,y), sets)

The memory requirement is quite stringent: one entry for each letter at each position. However, an entry means a word exist with a letter at this position, which is not the case for all.

The speed seems quite good too, if you wish to auto-complete a 3-letters word (classic threshold to trigger auto-completion):

  • 3 look ups in a hash-map
  • 2 intersections of sets (definitely the one spot), but ordered as to be as efficient as possible.

I would definitely need to try this out against the ternary tree and trie approach to see how it fares.

Matthieu M.