views:

317

answers:

5

I want to write a very simple Spell Checker. The spell checker will try to match the input word with equivalent words form the dictionary.

What can be done to find those 'equivalent words'? What analysis can be preformed on two words to mark them equivalent?

+5  A: 

Before investing too much trying to unravel that i'd first look to already existing implementations like Aspell or netspell for two main reasons

  1. Not much point in re-inventing the wheel. Spell checking is much trickier than it first appears and it makes sense to build on work that has already been done
  2. If your interest is finding out how to do it, the source code and community will be a great benefit should you decide to implement your own anyway
Conrad
That said, it's quite an interesting problem to work on. Reinventing the wheel can be educational too.
Richard Nichols
+1  A: 

Edit Distance is the theory you need to write a spell checker. You also need a dictionary. Most UNIX systems come with a dictionary already installed for your locale.

jldugger
A: 

Under linux/unix you have ispell. Why reinventing the whell.

Luixv
It's ironic that in an answer about spell checking you've spelt "wheel" incorrectly...!
Sean
They also missed a hyphen, but I doubt that was meant quite as ironically. :)
Joe Swan
Sorry. It was not ironically (unfortunately). In fact it was a typo. You a get 1 point from me :-)
Luixv
A: 

Much depends on your use case. For example:

  • Is your dictionary very small (about twenty words)? In this case it probably is better to precompute all possible nearby mistaken words and use a table/hash lookup.
  • What is your error model? Aspell has at least two (one for spelling errors caused by nearby letters on the keyboard, and the other for spelling errors caused by the way a word sounds).
  • How dynamic is your dictionary? Can you afford to do a massive preparation in order to get an efficient retrieval?
  • You may need a "word equivalence" measure like Double Metaphone, in addition to edit distance.
  • You can get some feel by reading Peter Norvig's great description of spelling correction.
  • And, of course, whenever possible, steal code. Do not reinvent the wheel without a reason - a reason could be a very special domain, a special way your users make spelling mistakes, or just to learn how it's done.
Yuval F
Thanks. Your suggestions have been very helpful.
sul4bh
Glad they did. Good luck.
Yuval F
A: 

I just finished implementing a spell checker and used a combination of the following in getting a list of "suggested" words

  • Phonetic hashing of the "misspelled" word to lookup a hash of identical dictionary hashed real words (for java check out Apache Commons Codec for a suitable library). The phonetic hash of your dictionary file can be precomputed.
  • Edit distance between the input and the potentials (this is reasonably expensive so you need to reduce the list first with something like a phonetic hash, assuming a higher volume load - in my case, a server based spell check)
  • A known list of common misspellings, e.g. recieve vs. receive.
  • An ordered list of the most common words in the english language

Essentially I weighted each potential word primarily based on edit-distance and commonality. e.g. if word probability is a percentage, then

weight = edit-distance *  100 / probability

(lower weights are better)

But then I also also override any result with the known common misspellings (i.e. these always float to the top suggested result).

There may be better ways, but this worked pretty well.

You may also wish to ignore ALL CAPS words, initials etc, so choosing what to ignore is also something to think about.

Richard Nichols