views:

941

answers:

5

Hi, I'm using levenshtein algorithm to meet these requirements:

When finding a word of N characters, the words to suggest as correction in my dictionary database are:

Every dictionary word of N characters that has 1 character of difference with the found word. Example: found word:bearn, dictionary word: bears

Every dictionary word of N+1 characters that has N characters equal to the found word. Example: found word:bear, dictionary word: bears

Every dictionary word of N-1 characters that has N-1 characters equal to the found word. Example: found word: bears, dictionary word: bear

I'm using this implementation of Levenshtein algorithm in C++ to find when a word has a Levenshtein number of 1 (which is the Levenshtein number for all three cases), but then how do I choose the word to suggest? I read about Boyer-Moore-Horspool and Knuth-Morris-Pratt but I'm not sure on how either of them can be helpful.

#include <string>
#include <vector>
#include <algorithm>

using namespace std;

int levenshtein(const string &s1, const string &s2)
{
   string::size_type N1 = s1.length();
   string::size_type N2 = s2.length();
   string::size_type i, j;
   vector<int> T(N2+1);

   for ( i = 0; i <= N2; i++ )
      T[i] = i;

   for ( i = 0; i < N1; i++ ) {
      T[0] = i+1;
      int corner = i;
      for ( j = 0; j < N2; j++ ) {
         int upper = T[j+1];
         if ( s1[i] == s2[j] )
            T[j+1] = corner;
         else
            T[j+1] = min(T[j], min(upper, corner)) + 1;
         corner = upper;
      }
   }
   return T[N2];
}
+3  A: 

You may also want to add Norvig's excellent article on spelling correction to your reading.

It's been a while since I've read it but I remember it being very similar to what your writing about.

Triptych
Thanks, but I don't know python..
Melkhiah66
the article might be in, but is not about python
Bedwyr Humphreys
Even without knowing Python code, you'll probably be able to follow the article. The only non-obvious stuff I see in that code is 'list comprehensions', which you can google for.
Triptych
A: 

You'll need to find a dictionary (a text file containing a list of words), and see if any of the strings you've created exist in the dictionary.

If they do, then add that to your list of words to suggest.

David
all the words he is comparing with the misspelling are in the dictionary
nlucaroni
Nowhere in the sample code do I see a dictionary being loaded in and stored into an array.
David
A: 

Why restrict the suggestion to a single word, why not include a set of words? If you are restricted to a single word, you can order your results by some pre-calculated frequency of usage or something. This frequency could be updated based on what users select from the suggestion.

Also, in the case where there isn't a spelling error in the original word, you might want to prioritize the N+1 cases, which would be more like an autocomplete. Anyway I don't think there is one correct way to do it, maybe if your requirements are more specific, it would be easier to narrow down.

Also, you don't need to know Python to understand the algorithms described in Norvig's article.

codelogic
+2  A: 

As I've said elsewhere, Boyer-Moore isn't really apt for this. Since you want to search for multiple stings simultanously, the algorithm of Wu and Manber should be more to your liking.

I've posted a proof of concept C++ code in answer to another question. Heed the caveats mentioned there.

Konrad Rudolph
A: 

If I understand you correctly, then there is no correct answer to your question. You are identifying up to three suggestions for a given word using Levenshtein - it is up to you to come up with a rule to decide which one to use and which ones to filter out. Or perhaps you should use them all?

Just as a matter of interest, the Damerau extension to Levenshtein might be of interest to you, where two swapped characters are also considered to give a score of 1, instead of 2, which is what vanilla Levenshtein returns.