views:

1186

answers:

7

I need to implement a spell checker in C. Basically, I need all the standard operations... I need to be able to spell check a block of text, make word suggestions and dynamically add new words to the index.

I'd kind of like to write this myself, tho I really don't know where to begin.

+3  A: 

Given you don't know where to begin, I'd suggest using an existing solution. See, for example, aspell (GLPL licenced). If you really have to implement it yourself, please tell us why.

Paul
I'm curious. I may choose to use an existing package, but I'd like to understand this problem first. After googling a bit, I'm considering using a levenstein distance algorithm against a dictionary, but I'm not sure if that will be fast enough (speed and code size matter).
dicroce
Another open source spell checker is hunspell, it is used by OOo and Mozilla.
quinmars
+17  A: 
e.James
This would be great and fast for checking for correct spelling, but I really need suggestions...
dicroce
I added a link to the Wikipedia article on Levenshtein Distance. This is the best measurement algorithm I could find for ranking words according to how far away they are from a target word (the incorrectly spelled word, in this case)
e.James
The type of tree you are describing is a Patricia tree or radix tree; Wikipedia gives a good description at: http://en.wikipedia.org/wiki/Patricia_tree
Curt Sampson
@Curt Sampson: Thank you. I have added that link to my answer.
e.James
a ternary tree is described in 2. It has a pretty big space tradeoff.
HeretoLearn
@HeretoLearn: I don't think that is correct. A ternary tree has a maximum of three children per node, but the structure I described has up to 26 children per node.
e.James
+10  A: 

There's a very interesting Python example and explanations there:

http://norvig.com/spell-correct.html

Martin Cote
A: 

One should look at prefixes and suffixes.

suddenly = sudden + ly.

by removing ly's you can get away storing just the root word.

Likewise preallocate = pre + allocate.

And lovingly = love + ing + ly gets a bit more complex, as the english rules for ing get invoked.

There is also the possibility of using some sort of hashing function to map a root word into a specific bit is a large bit map, as a constant time method of determining if the root word is spelled correctly.

You can get even more complex by trying to provide an alternate list of possible correct spellings to a misspelled word. You might research the soundex algorithm to get some ideas.

I would advise prototyping with a small set of words. Do a lot of testing, then scale up. It is a wonderful educational problem.

EvilTeach
I can't agree with this. Algorithmically stemming English words is notoriously inaccurate. It might have been worth the space and time saving back in the dark ages, but not any more.
Daniel Cassidy
"English doesn't borrow from other languages, it follows them into dark alleys, knocks them over, and goes through their pockets for loose grammar."
Ben Blank
A: 

Splitting a word into root and suffix is knonw as the "Porter Stemming Algorithm" it's a good way of fitting an English ditionary into an amazingly small memory.
It's also useful for seach so "spell checker" will also find "spelling check" and "spell checking"

Martin Beckett
A: 

I've done this in class

You should consider python Natural Language Toolkit NLTK which is made specificaly to handle this.

It also allows to create text interpreters such as chatbots

Eric
A: 

The Open Office Spell checker Hunspell can be a good starting point. Here is the Homepage: Hunspell at Sourceforge

Thomas Maierhofer