I've got a list of 10000 keywords. What is an efficient search algorithm to provide auto-completion with that list?
A trie: http://en.wikipedia.org/wiki/Trie gives you O(N) search time whenever you type a letter (I'm assuming you want new suggestions whenever a letter is typed). This should be fairly efficient if you have small words and the search space will be reduced with each new letter.
As you’ve already mentioned that you store your words in a database (see http://stackoverflow.com/questions/2331919), create an index of that words and let the database do the work. They know how to do that efficiently.
Using a Trie is an option but they are space-inefficient. They can be made more space effecient by using a modified version known as a Radix Tree, or Patricia Tree.
A ternary search tree would probably be a better option. Here is an article on the subject: "Efficient auto-complete with a ternary search tree." Another excellent article on the use of Ternary Search Tree's for spelling-correction (a similar problem to auto-complete) is, "Using Ternary DAGs for spelling correction."
A rather roundabout method was proposed on SO for crosswords.
It could be adapted here fairly easy :)
The idea is simple, yet quite efficient: it consists on indexing the words, building one index per letter position. It should be noted that after 4/5 letters, the subset of words available is so small that a brute force is probably best... this would have to be measured of course.
As for the idea, here is a Python way:
class AutoCompleter:
def __init__(self, words):
self.words = words
self.map = defaultdict(set)
self._map()
def _map(self):
for w in words:
for i in range(0,len(w)):
self.map[(i,w[i])].insert(w)
def words(self, firstLetters):
# Gives all the sets for each letter
sets = [self.map[(i, firstLetters[i])] for i in range(0, len(firstLetters))]
# Order them so that the smallest set is first
sets.sort(lambda x,y: cmp(len(x),len(y)))
# intersect all sets, from left to right (smallest to biggest)
return reduce(lambda x,y: intersection(x,y), sets)
The memory requirement is quite stringent: one entry for each letter at each position. However, an entry means a word exist with a letter at this position, which is not the case for all.
The speed seems quite good too, if you wish to auto-complete a 3-letters word (classic threshold to trigger auto-completion):
- 3 look ups in a hash-map
- 2 intersections of sets (definitely the one spot), but ordered as to be as efficient as possible.
I would definitely need to try this out against the ternary tree and trie approach to see how it fares.