views:

483

answers:

7

Hello,

I need to code a solution for a certain requirement, and I wanted to know if anyone is either familiar with an off-the-shelf library that can achieve it, or can direct me at the best practice. Description:

The user inputs a word that is supposed to be one of several fixed options (I hold the options in a list). I know the input must be in a member in the list, but since it is user input, he/she may have made a mistake. I'm looking for an algorithm that will tell me what is the most probable word the user meant. I don't have any context and I can’t force the user to choose from a list (i.e. he must be able to input the word freely and manually).

For example, say the list contains the words "water", “quarter”, "beer", “beet”, “hell”, “hello” and "aardvark".

The solution must account for different types of "normal" errors:

  • Speed typos (e.g. doubling characters, dropping characters etc)
  • Keyboard adjacent-character typos (e.g. "qater" for “water”)
  • Non-native English typos (e.g. "quater" for “quarter”)
  • And so on...

The obvious solution is to compare letter-by-letter and give "penalty weights" to each different letter, extra letter and missing letter. But this solution ignores thousands of "standard" errors I'm sure are listed somewhere. I'm sure there are heuristics out there that deal with all the cases, both specific and general, probably using a large database of standard mismatches (I’m open to data-heavy solutions).

I'm coding in Python but I consider this question language-agnostic.

Any recommendations/thoughts?

+10  A: 

You want to read how google does this: http://norvig.com/spell-correct.html

Edit: Some people have mentioned algorithms that define a metric between a user given word and a candidate word (levenshtein, soundex). This is however not a complete solution to the problem, since one would also need a datastructure to efficiently perform a non-euclidean nearest neighbour search. This can be done e.g. with the Cover Tree: http://hunch.net/~jl/projects/cover_tree/cover_tree.html

bayer
+2  A: 

Have you considered algorithms that compare by phonetic sounds, such as soundex? It shouldn't be too hard to produce soundex representations of your list of words, store them, and then get a soundex of the user input and find the closest match there.

workmad3
+5  A: 

A common solution is to calculate the Levenshtein distance between the input and your fixed texts. The Levenshtein distance of two strings is just the number of simple operations - insertions, deletions, and substitutions of a single character - required to turn one of the string into the other.

Daniel Brückner
A: 

Though it may not solve the entire problem, you may want to consider using the soundex algorithm as part of the solution. A quick google search of "soundex" and "python" showed some python implementations of the algorithm.

B Pete
A: 

Try searching for "Levenshtein distance" or "edit distance". It counts the number of edit operations (delete, insert, change letter) you need to transform one word into another. It's a common algorithm, but depending on the problem you might need something special with different weights for the different types of typos.

ziggystar
+1  A: 

Look for the Bitap algorithm. It qualifies well for what you want to do, and even comes with a source code example in Wikipedia.

Jem
+1  A: 

If your data set is really small, simply comparing the Levenshtein distance on all items independently ought to suffice. If it's larger, though, you'll need to use a BK-Tree or similar indexing system. The article I linked to describes how to find matches within a given Levenshtein distance, but it's fairly straightforward to adapt to do nearest-neighbor searches (and left as an exercise to the reader ;).

Nick Johnson