views:

405

answers:

4

I'm writing an application that needs to read a list of strings from a file, save them in a data structure, and then look up those strings by prefixes. The list of strings is simply a list of words in a given language. For example, if the search function gets "stup" as a parameter, it should return ["stupid", "stupidity", "stupor"...]. It should do so in O(log(n)*m) time, where n is the size of the data structure and m is the number of results and should be as fast as possible. Memory consumption is not a big issue right now. I'm writing this in python, so it would be great if you could point me to a suitable data structure (preferably) implemented in c with python wrappers.

+12  A: 

You want a trie.

http://en.wikipedia.org/wiki/Trie

I've used them in Scrabble and Boggle programs. They're perfect for the use case you described (fast prefix lookup).

Here's some sample code for building up a trie in Python. This is from a Boggle program I whipped together a few months ago. The rest is left as an exercise to the reader. But for prefix checking you basically need a method that starts at the root node (the variable words), follows the letters of the prefix to successive child nodes, and returns True if such a path is found and False otherwise.

class Node(object):
    def __init__(self, letter='', final=False):
        self.letter = letter
        self.final = final
        self.children = {}
    def __contains__(self, letter):
        return letter in self.children
    def get(self, letter):
        return self.children[letter]
    def add(self, letters, n=-1, index=0):
        if n < 0: n = len(letters)
        if index >= n: return
        letter = letters[index]
        if letter in self.children:
            child = self.children[letter]
        else:
            child = Node(letter, index==n-1)
            self.children[letter] = child
        child.add(letters, n, index+1)

def load_dictionary(path):
    result = Node()
    for line in open(path, 'r'):
        word = line.strip().lower()
        result.add(word)
    return result

words = load_dictionary('dictionary.txt')
FogleBird
I'd be tempted to write Node.add() as an iterative method: def add(self, letters): next = self n = len(letters) for index, letter in enumerate(letters): if not (letter in next.children): next.children[letter] = Node(letter, index==n-1) next = next.children[letter]
hughdbrown
But try to imagine that with correct indentation and line breaks.
hughdbrown
+2  A: 

A trie (or prefix tree) sounds right up your alley. It can do the search on a prefix string of length m in O(m) I believe.

Falaina
A: 

string array.

then binary search through it to search the first match then step one by one through it for all subsequent matches

(i originally had linked list here too... but of course this doesn't have random access so this was 'bs' (which probably explains why I was downvoted). My binary search algorithm still is the fastest way to go though

Toad
my god, give the guy a break.
orlandu63
so many downvotes, and no one cares to explain why? It feels like people are just piling on. This answer satisfies all the constraints the original user requested, and is simpler to understand and maintain.
Michael Donohue
@reinier - I imagine you got downvoted because linked lists are not good for random access, so your search time is linear in the number of elements. An array would be quick(ish) if it was already sorted, but it'd be O(log n), whereas a trie is O(m), where m is the length of the prefix.
Dominic Rodger
yes true. It needs randomaccess (or else th binary search would not work) ;^)so yes array is the only sane choice
Toad
A trie is better because lookups are O(m) where m is the length of the *key* (the length of a word, so probably less than 10 characters) where as a binary search will be O(mlog(n)) where n is the size of the *whole dictionary* (and note that m is still present). So a trie scales very well even with very long word lists.
FogleBird
reading the wiki right now.... nice solution
Toad