How do spell checkers work?

views:

1186

answers:

+10 Q:

How do spell checkers work?

I need to implement a spell checker in C. Basically, I need all the standard operations... I need to be able to spell check a block of text, make word suggestions and dynamically add new words to the index.

I'd kind of like to write this myself, tho I really don't know where to begin.

+3 A:

Given you don't know where to begin, I'd suggest using an existing solution. See, for example, aspell (GLPL licenced). If you really have to implement it yourself, please tell us why.

Paul 2008-12-06 20:58:34

I'm curious. I may choose to use an existing package, but I'd like to understand this problem first. After googling a bit, I'm considering using a levenstein distance algorithm against a dictionary, but I'm not sure if that will be fast enough (speed and code size matter).

dicroce 2008-12-06 21:04:42

Another open source spell checker is hunspell, it is used by OOo and Mozilla.

quinmars 2008-12-07 13:29:04

+17 A:

e.James 2008-12-06 21:35:10

This would be great and fast for checking for correct spelling, but I really need suggestions...

dicroce 2008-12-06 22:08:06

I added a link to the Wikipedia article on Levenshtein Distance. This is the best measurement algorithm I could find for ranking words according to how far away they are from a target word (the incorrectly spelled word, in this case)

e.James 2008-12-06 22:30:49

The type of tree you are describing is a Patricia tree or radix tree; Wikipedia gives a good description at: http://en.wikipedia.org/wiki/Patricia_tree

Curt Sampson 2009-06-20 03:16:46

@Curt Sampson: Thank you. I have added that link to my answer.

e.James 2009-06-20 04:55:29

a ternary tree is described in 2. It has a pretty big space tradeoff.

HeretoLearn 2009-07-03 18:29:39

@HeretoLearn: I don't think that is correct. A ternary tree has a maximum of three children per node, but the structure I described has up to 26 children per node.

e.James 2009-07-03 18:46:14

+10 A:

There's a very interesting Python example and explanations there:

http://norvig.com/spell-correct.html

Martin Cote 2008-12-06 21:39:31

One should look at prefixes and suffixes.

suddenly = sudden + ly.

by removing ly's you can get away storing just the root word.

Likewise preallocate = pre + allocate.

And lovingly = love + ing + ly gets a bit more complex, as the english rules for ing get invoked.

There is also the possibility of using some sort of hashing function to map a root word into a specific bit is a large bit map, as a constant time method of determining if the root word is spelled correctly.

You can get even more complex by trying to provide an alternate list of possible correct spellings to a misspelled word. You might research the soundex algorithm to get some ideas.

I would advise prototyping with a small set of words. Do a lot of testing, then scale up. It is a wonderful educational problem.

EvilTeach 2008-12-07 02:22:11

I can't agree with this. Algorithmically stemming English words is notoriously inaccurate. It might have been worth the space and time saving back in the dark ages, but not any more.

Daniel Cassidy 2008-12-07 02:36:14

"English doesn't borrow from other languages, it follows them into dark alleys, knocks them over, and goes through their pockets for loose grammar."

Ben Blank 2009-09-09 18:23:46

Splitting a word into root and suffix is knonw as the "Porter Stemming Algorithm" it's a good way of fitting an English ditionary into an amazingly small memory.
It's also useful for seach so "spell checker" will also find "spelling check" and "spell checking"

Martin Beckett 2008-12-07 02:51:35

I've done this in class

You should consider python Natural Language Toolkit NLTK which is made specificaly to handle this.

It also allows to create text interpreters such as chatbots

Eric 2009-06-20 05:01:02

The Open Office Spell checker Hunspell can be a good starting point. Here is the Homepage: Hunspell at Sourceforge

Thomas Maierhofer 2009-09-09 18:14:05

ansaurus

tags:

views:

answers:

How do spell checkers work?

related questions