views:

183

answers:

4

I'm sure more than a few of you will have seen the Google Wave demonstration. I was wondering about the spell checking technology specificially. How revolutionary is a spell checker which works by figuring out where a word appears contextually within a sentence to make these suggestions ?

I haven't seen this technique before, but are there examples of this elsewhere?
and if so are there code examples and literature into its workings ?

+8  A: 

My 2 cents. Given the fact that translate.google.com is a statistical machine translation engine and "The Unreasonable Effectiveness of Data" from A Halevy, P Norvig (Director of Research at Google) & F Pereira: I make the assumption (bet) that this is a statistically driven spell checker.

How it could work: you collect a very large corpus of the language you want to spell check. You store this corpus as phrase-tables in adapted datastructures (suffix arrays for example if you have to count the n-grams subsets) that keep track of the count (an so an estimated probability of) the number of n-grams.

For example, if your corpus is only constitued of:

I had bean soup last diner.

From this entry, you will generate the following bi-grams (sets of 2 words):

I had, had bean, bean soup, soup last, last diner

and the tri-grams (sets of 3 words):

I had bean, had bean soup, bean soup last, soup last diner

But they will be pruned by tests of statistical relevance, for example: we can assume that the tri-gram

I had bean

will disappear of the phrase-table.

Now, spell checking is only going to look is this big phrase-tables and check the "probabilities". (You need a good infrastructure to store this phrase-tables in an efficient data structure and in RAM, Google has it for translate.google.com, why not for that ? It's easier than statistical machine translation.)

Ex: you type

I had been soup

and in the phrase-table there is a

had bean soup

tri-gram with a much higher probability than what you just typed! Indeed, you only need to change one word (this is a "not so distant" tri-gram) to have a tri-gram with a much higher probability. There should be an evaluating function dealing with the trade-off distance/probability. This distance could even be calculated in terms of characters: we are doing spell checking, not machine translation.

This is only my hypothetical opinion. ;)

SnippyHolloW
+1  A: 

You should also watch an official video by Casey Whitelaw of the Google Wave team that describes the techniques used: http://www.youtube.com/watch?v=Sx3Fpw0XCXk

guitardave
+1  A: 

You can learn all about topics like this by diving into natural language processing. You can even go as in-depth as making a statistical guess as to which word will come next after a string of given words.

If you are interested in such a topic, I highly suggest using the NLTK (natural language toolkit) written entirely in python. it is a very expansive work, having many tools and pretty good documentation.

San Jacinto
+1  A: 

There are a lot of papers on this subject. Here are some good resources

This doesn't use context sensitivity, but it's a good base to build from http://norvig.com/spell-correct.html

This is probably a good and easy to understand view of a more powerful spell checker http://acl.ldc.upenn.edu/acl2004/emnlp/pdf/Cucerzan.pdf

From here you can dive deep on the particulars. I'd recommend using google scholar and looking up the references in the paper above, and searching for 'spelling correction'

jshen