I'm writing an application that needs to read a list of strings from a file, save them in a data structure, and then look up those strings by prefixes. The list of strings is simply a list of words in a given language. For example, if the search function gets "stup" as a parameter, it should return ["stupid", "stupidity", "stupor"...]. It should do so in O(log(n)*m) time, where n is the size of the data structure and m is the number of results and should be as fast as possible. Memory consumption is not a big issue right now. I'm writing this in python, so it would be great if you could point me to a suitable data structure (preferably) implemented in c with python wrappers.
views:
405answers:
4You want a trie.
http://en.wikipedia.org/wiki/Trie
I've used them in Scrabble and Boggle programs. They're perfect for the use case you described (fast prefix lookup).
Here's some sample code for building up a trie in Python. This is from a Boggle program I whipped together a few months ago. The rest is left as an exercise to the reader. But for prefix checking you basically need a method that starts at the root node (the variable words
), follows the letters of the prefix to successive child nodes, and returns True if such a path is found and False otherwise.
class Node(object):
def __init__(self, letter='', final=False):
self.letter = letter
self.final = final
self.children = {}
def __contains__(self, letter):
return letter in self.children
def get(self, letter):
return self.children[letter]
def add(self, letters, n=-1, index=0):
if n < 0: n = len(letters)
if index >= n: return
letter = letters[index]
if letter in self.children:
child = self.children[letter]
else:
child = Node(letter, index==n-1)
self.children[letter] = child
child.add(letters, n, index+1)
def load_dictionary(path):
result = Node()
for line in open(path, 'r'):
word = line.strip().lower()
result.add(word)
return result
words = load_dictionary('dictionary.txt')
A trie (or prefix tree) sounds right up your alley. It can do the search on a prefix string of length m in O(m) I believe.
string array.
then binary search through it to search the first match then step one by one through it for all subsequent matches
(i originally had linked list here too... but of course this doesn't have random access so this was 'bs' (which probably explains why I was downvoted). My binary search algorithm still is the fastest way to go though