views:

494

answers:

2

What algorithm is typically used when implementing a spell checker that is accompanied with word suggestions?

At first I thought it might make sense to check each new word typed (if not found in the dictionary) against it's Levenshtein distance from every other word in the dictionary and returning the top results. However, this seems like it would be highly inefficient, having to evaluate the entire dictionary repeatedly.

How is this typically done?

+18  A: 

There is good essay by Peter Norvig how to implement a spelling corrector. It's basicly a brute force approach trying candidate strings with a given edit distance. (Here are some tips how you can improve the spelling corrector performance using a Bloom Filter and faster candidate hashing.)

The requirements for a spell checker are weaker. You have only to find out that a word is not in the dictionary. You can use a Bloom Filter to build a spell checker which consumes less memory. An ancient versions is decribed in Programming Pearls by Jon Bentley using 64kb for an english dictionary.

A BK-Tree is an alternative approach. A nice article is here.

Levenshstein distance is not exactly the right edit distance for a spell checker. It knows only insertion, deletion and substitution. Transposition is missing and produces 2 for a transposition of 1 character (it's 1 delete and 1 insertion). Damerau–Levenshtein distance is the right edit distance.

Thomas Jung
Interesting article. +1
Simon P Stevens
+1: Excellent answer and interesting references.
Max Shawabkeh
+1 Excellent references, I knew of `Norvig` but I was quite astonished by the `Bloom Filter` article.
Matthieu M.
+1 for the relatively unknown BK-Tree reference. That's how companies like Google, working with Real-World [TM] amount of data, are doing it.
NoozNooz42
Fantastic +1, bloom filter is great...
Dori
A: 

You don't need to know exact edit distance for each word in dictionary. You can stop the algorithm after reaching a limit value and exclude the word. This will save you a lot of computing time.

Petr Peller