views:

241

answers:

3

I am working on implementing a spell check function for a web-based WYSIWYG editor. I am currently using the Damerau-Levenshtein distance algorithm to produce a list of spelling suggestions. This is all working out nicely, but I am curious as to how I might improve the functionality.

Specifically, my implementation does not currently handle conjoined words. For instance, I would like to be able to detect "areyou" and suggest "are you" instead. I think I can do this by breaking the potentially conjoined word apart at likely looking segments and testing both halves. Since all English words must have at least one vowel, I think I can look for vowels to help me decide where to break words apart.

The Damerau-Levenshtein distance algorithm was so useful; it is clear that others have put a lot more thought into this than I have. Is there a similarly clever algorithm that I should consider for detecting conjoined words, or am I on the right track already?

+3  A: 
Heath Hunnicutt
+1  A: 

As you are already reading the whole dictionary for every word, it wouldn't be terribly inefficient to append common pairs of words to the dictionary. Alternatively, you can divide the input (possibly conjoined word) into two words in all possible ways and then look for words near each of them in the dictionary. It isn't as slow as it sounds -- you can use DL intermediate results of a word to get the results for its prefix.

Robert Obryk
+1  A: 

Check out this excellent article on writing a spelling checker. Using that technique, you have two options: Either include every pair of words, or every likely pair of words in the dictionary (with the separated words as the solution), or try every possible split point and do a standard dictionary lookup to see if both words are valid.

Nick Johnson