tags:

views:

1502

answers:

5

I'm interested in implementing autocomplete in Python. For example, as the user types in a string, I'd like to show the subset of files on disk whose names start with that string.

What's an efficient algorithm for finding strings that match some condition in a large corpus (say a few hundred thousand strings)? Something like:

matches = [s for s in allfiles if s.startswith(input)]

I'd like to have the condition be flexible; eg. instead of a strict startswith, it'd be a match so long as all letters in input appears in s in the same order. What's better than the brute-force method I'm showing here?

+3  A: 

I used Lucene to autocomplete a text field with more then a hundred thousand possibilities and I perceived it as instantaneous.

Toni Ruža
A: 

The flexibility you want for matching your string is called Fuzzy Matching or Fuzzy Searching . I am not aware of any python implementation (but I haven't looked deeply in the subject) but there are C/C++ implementations that you can reuse, like the TRE packaged that supports regexp with fuzzy parameters.

Apart from that, there is always the question of whether the total list of your words fits in memory or not. If not, keeping them in a list is not feasible and will have to cache something to disk or to a database.

Bluebird75
+1  A: 

Maybe the readline module is what you are looking for. It is an interface to the GNU readline library Python Documentation. Maybe you can supply your own completition function with set_completer().

unbeknown
A: 

(addressing just the string matching part of the question)

If you want to try something quickly yourself, why not create a few dictionaries, each mapping initial patterns to lists of strings where

  • Each dictionary is keyed on initial patterns of a particular length
  • All the strings in a string list start with the initial pattern
  • An initial pattern/string list pair is only created if there are less than a certain number (say 10) of strings in the list

So, when the user has typed three characters, for example, you look in the dictionary with keys of length 3. If there is a match, it means you have between 1 and 10 possibilities immediately available.

James Hopkin
+6  A: 

For exact matching, generally the way to implement something like this is to store your corpus in a trie. The idea is that you store each letter as a node in the tree, linking to the next letter in a word. Finding the matches is simply walking the tree, and showing all children of your current location. eg. "cat", "cow" and "car" would be stored as:

  a--t
 / \ 
c   r
 \
  o--w

When you get a c, you start at the c node, an a will then take you to the c/a node (children "t" and "r", making cat and car as your completions).

Note that you'll also need to mark nodes that are complete words to handle names that are substrings of others (eg "car" and "cart")

To get the desired fuzzy matching, you may need to make some changes however.

Brian