views:

201

answers:

4

Assuming that a general Trie of dictionary words is built, what would be the best method to check for the 4 cases of spelling mistakes - substitution, deletion, transposition and insertion during traversal?

One method is to figure out all the words within n edit distances of a given word and then checking for them in the Trie. This isn't a bad option, but a better intuition here seems to be use a dynamic programming (or a recursive equivalent) method to determine the best sub-tries after having modified the words during traversal.

Any ideas would be welcome!

PS, would appreciate actual inputs rather than just links in answers.

+1  A: 

Take a look at calculating the Levenshtein distance which provides a dynamic programming solution for finding the distance between two sequences.

Taylor Leese
Thanks for the answer. I know about Levenshtein - the problem I'm dealing with is that I don't know which word to do a comparison with. Instead, what I want to do is *generate* a list of words within a given edit distance..
viksit
+1  A: 

I think you can do this with a straightforward breadth-first search on the tree: choose a threshold of the number of errors you are looking for, simply run through the letters of the word to be matched one at a time, generating a set of (prefix, subtrie) pairs reached so far matching the prefix, and while you are beneath your error threshold, add to your set of next subgoals:

  1. No error at this character place: add the subgoal of the trie at the next character in the word
  2. An inserted, deleted, or substituted character at this place: find the appropriate trie there, and increment the error count;
  3. Not an additional goal, but note that transpositions are either an insertion or deletion that matches an earlier deletion or insertion: if this test hold, then don't increment the error count.

This seems pretty naive: is there a problem with this that led you to think of dynamic programming?

Charles Stewart
+1  A: 

Presuming each successive character in your word represents one level in your tree, then you have five cases to check at each character (a match, deletion, insertion, substitution and transposition). I'm assuming transpositions are two adjacent characters.

You will need a function (CheckNode) that accepts a tree node and a character to check. It will need to return a set of (child/grand-child) nodes representing matches.

You will need a function (CheckWord) that accepts a word. It checks each character in turn against a set of nodes. It will return a set of (leaf) nodes representing matched words.

The idea is that each level in the tree (child, grand-child etc.), matches the position of the character in the word. If you call the top level tree node, level 0, then you'll have level 1, level 2 etc.

Clearly for a word without errors, there is a one to one match between the character position and the level in the tree.

For deletions, you need to skip a level in the tree.

For insertions, you need to skip a character in the word.

For substitutions, you need to skip both a level and a character.

For transpositions, you need to (temporarily) swap the characters in the word.

Jonathan Swift
Does this differ from my answer, apart from in restricting substitutions to be adjacent characters?
Charles Stewart
@Charles Stewart I agree it's still a breadth first search. But I'm not relying on an error count as Viksit only wants spelling suggestions (leaf nodes). I'm also providing a hint to the potential structure of a program that might solve the problem.
Jonathan Swift
+4  A: 

I actually wrote some code to do this the other day:

http://github.com/teoryn/SpellCheck/blob/master/spell_checker.py

It's based on the code by Peter Norvig (http://norvig.com/spell-correct.html) but stores the dictionary in a trie for finding words within a given edit distance faster.

The algorithm walks the trie recursively applying the possible edits (or not) at each step along the way by consuming letters from the input word. A parameter to the recursive call states how many more edits can be made. The trie helps narrow the search space by checking which letters can actually be reached from our given prefix. For example, when inserting a character, instead of adding each letter in the alphabet, we only add letters that are reachable from the current node. Not making an edit is equivalent to taking the branch from the current node in the trie along the current letter from the input word. If that branch is not there then we can backtrack and avoid searching a possibly large space where no real words could be found.

Kevin Stock
+1 code. I'm surprised there aren't more tricky cases in the code.
Charles Stewart