views:

508

answers:

5

We have a list of about 150,000 words, and when the user enters a free text, the system should present a list of words from the dictionary, that are very close to words in the free text.

For instance, the user enters: "I would like to buy legoe toys in Walmart". If the dictionary contains "Lego", "Car" and "Walmart", the system should present "Lego" and "Walmart" in the list. "Walmart" is obvious because it is identical to a word in the sentence, but "Lego" is similar enough to "Legoe" to be mentioned, too. However, nothing is similar to "Car", so that word is not shown.

Showing the list should be realtime, meaning that when the user has entered the sentence, the list of words must be present on the screen. Does anybody know a good algorithm for this?

The dictionary actually contains concepts which may include a space. For instance, "Lego spaceship". The perfect solution recognizes these multi-word concepts, too.

Any suggestions are appreciated.

+1  A: 

It might be of interest to look at a some algorithms such as the Levenshtein distance, which can calculate the amount of difference between 2 strings.

I'm not sure what language you are thinking of using but PHP has a function called levenshtein that performs this calculation and returns the distance. There's also a function called similar_text that does a similar thing. There's a code example here for the levenshtein function that checks a word against a dictionary of possible words and returns the closest words.

I hope this gives you a bit of insight into how a solution could work!

Michael Waterfall
Calculating the Levenshtein distance of two words is very expensive. Running the PHP method against every single word in his data set would likely take a very long time.
Ben S
Yup, Levenshtein is not suitable for string-to-dictionary comparisons; it's a string-to-string measure.
MSalters
Very true. I don't know much about it to be honest I just remembered something about Levenshtein distances! With such a large dictionary then something like Ben S suggested such a indexing the dictionary and implementing some sort of fuzzy string matching would be the most optimum method.
Michael Waterfall
A: 

What you want is a spell checker. Here's one i 22 lines of C# code: http://www.codegrunt.co.uk/code/spell.cs

danbystrom
+2  A: 

You would likely want to use an algorithm which calculates the Levenshtein distance.

However, since your data set is quite large, and you'll be comparing lots of words against it, a direct implementation of typical algorithms that do this won't be practical.

In order to find words in a reasonable amount of time, you will have to index your set of words in some way that facilitates fuzzy string matching.

One of these indexing methods would be to use a suffix tree. Another approach would be to use n-grams.

I would lean towards using a suffix tree since I find it easier to wrap my head around it and I find it more suited to the problem.

Ben S
+4  A: 

Take a look at http://norvig.com/spell-correct.html for a simple algorithm. The article uses Python, but there are links to implementations in other languages at the end.

Phil Ross
+1, Norvig is always a good recommendation there :)
Matthieu M.
+2  A: 

You will be doing quite a few lookups of words against a fixed dictionary. Therefore you need to prepare your dictionary. Logically, you can quickly eliminate candidates that are "just too different".

For instance, the words car and dissimilar may share a suffix, but they're obviously not misspellings of each other. Now why is that so obvious to us humans? For starters, the length is entirely different. That's an immediate disqualification (but with one exception - below). So, your dictionary should be sorted by word length. Match your input word with words of similar length. For short words that means +/- 1 character; longer words should have a higher margin (exactly how well can your demographic spell?)

Once you've restricted yourself to candidate words of similar length, you'd want to strip out words that are entirely dissimilar. With this I mean that they use entirely different letters. This is easiest to compare if you sort the letters in a word alphabetically. E.g. car becomes "acr"; rack becomes "ackr". You'll do this in preprocessing for your dictionary, and for each input word. The reason is that it's cheap to determine the (size of an) difference of two sorted sets. (Add a comment if you need explanation). car and rack have an difference of size 1, car and hat have an intersection of size 2. This narrows down your set of candidates even further. Note that for longer words, you can bail out early when you've found too many differences. E.g. dissimilar and biography have a total difference of 13, but considering the length (8/9) you can probably bail out once you've found 5 differences.

This leaves you with a set of candidate words that use almost the same letters, and also are alomst the same length. At this point you can start using more refined algorithms; you don't need to run 150.000 comparisons per input word anymore.

Now, for the length exception mentioned before: The problem is in "words" like greencar. It doesn't really match a word of length 8, and yet for humans it's quite obvious what was meant. In this case, you can't really break the input word at any random boundary and run an additional N-1 inexact matches against both halves. However, it is feasible to check for just a missing space. Just do a lookup for all possible prefixes. This is efficient because you'll be using the same part of the dictionary over and over, e.g. g gr, gre, gree, etc. For every prefix that you've found, check if the remaining suffixis also in the dictionery, e.g. reencar, eencar. If both halves of the input word are in the dictionary, but the word itself isn't, you can assume a missing space.

MSalters