ansaurus

Question

most efficient data structure for a read-only list of strings (about 100,000) with fast prefix search

Answer 1

+12 A:

You want a trie.

http://en.wikipedia.org/wiki/Trie

I've used them in Scrabble and Boggle programs. They're perfect for the use case you described (fast prefix lookup).

Here's some sample code for building up a trie in Python. This is from a Boggle program I whipped together a few months ago. The rest is left as an exercise to the reader. But for prefix checking you basically need a method that starts at the root node (the variable words), follows the letters of the prefix to successive child nodes, and returns True if such a path is found and False otherwise.

class Node(object):
    def __init__(self, letter='', final=False):
        self.letter = letter
        self.final = final
        self.children = {}
    def __contains__(self, letter):
        return letter in self.children
    def get(self, letter):
        return self.children[letter]
    def add(self, letters, n=-1, index=0):
        if n < 0: n = len(letters)
        if index >= n: return
        letter = letters[index]
        if letter in self.children:
            child = self.children[letter]
        else:
            child = Node(letter, index==n-1)
            self.children[letter] = child
        child.add(letters, n, index+1)

def load_dictionary(path):
    result = Node()
    for line in open(path, 'r'):
        word = line.strip().lower()
        result.add(word)
    return result

words = load_dictionary('dictionary.txt')

FogleBird 2009-07-15 12:09:40

I'd be tempted to write Node.add() as an iterative method: def add(self, letters): next = self n = len(letters) for index, letter in enumerate(letters): if not (letter in next.children): next.children[letter] = Node(letter, index==n-1) next = next.children[letter]

hughdbrown 2009-07-15 13:00:47

But try to imagine that with correct indentation and line breaks.

hughdbrown 2009-07-15 13:02:19

Answer 2

+2 A:

A trie (or prefix tree) sounds right up your alley. It can do the search on a prefix string of length m in O(m) I believe.

Falaina 2009-07-15 12:10:52

Answer 3

A:

string array.

then binary search through it to search the first match then step one by one through it for all subsequent matches

(i originally had linked list here too... but of course this doesn't have random access so this was 'bs' (which probably explains why I was downvoted). My binary search algorithm still is the fastest way to go though

Toad 2009-07-15 12:11:01

my god, give the guy a break.

orlandu63 2009-07-15 14:28:46

so many downvotes, and no one cares to explain why? It feels like people are just piling on. This answer satisfies all the constraints the original user requested, and is simpler to understand and maintain.

Michael Donohue 2009-07-15 14:31:05

@reinier - I imagine you got downvoted because linked lists are not good for random access, so your search time is linear in the number of elements. An array would be quick(ish) if it was already sorted, but it'd be O(log n), whereas a trie is O(m), where m is the length of the prefix.

Dominic Rodger 2009-07-15 14:32:23

yes true. It needs randomaccess (or else th binary search would not work) ;^)so yes array is the only sane choice

Toad 2009-07-15 14:36:25

A trie is better because lookups are O(m) where m is the length of the *key* (the length of a word, so probably less than 10 characters) where as a binary search will be O(mlog(n)) where n is the size of the *whole dictionary* (and note that m is still present). So a trie scales very well even with very long word lists.

FogleBird 2009-07-15 15:05:28

reading the wiki right now.... nice solution

Toad 2009-07-15 15:09:03

Answer 4

+4 A:

Some Python implementations of tries:

ThibThib 2009-07-15 12:13:44

ansaurus

tags:

views:

answers:

most efficient data structure for a read-only list of strings (about 100,000) with fast prefix search

related questions