views:

140

answers:

5

What are books about how to build a natural language parsing program like this:

input: I got to TALL you
output: I got to TELL you

input: Big RAT box
output: Big RED box

in: hoo un thum zend three
out: one thousand three

It must have the language model that allows to predict what words are misspelled !

What are the best books on how to build such a tool??

p.s. Are there free webservices to spell-check? From Google maybe?..

+1  A: 

soundex (wiki) is one option

PeanutPower
As George Bernard Shaw (amongst many others) always complained, there is often a great divergence between how things are spelled and how they are pronounced. At least in English. SOUNDEX() might be an effective approach in, say, Italian.
APC
This one is built into the Delphi RTL, its pretty unpredictable, but fairly cool - good for people who like write fenetiklee err.. phonetically.
Peter Turner
+4  A: 

Peter Norvig has written a terrific spell checker. Maybe that can help you.

duffymo
Just was going to link it :-) +1
Pavel Shved
Cool script. Seems like it would be straightforward to extend it to word bigrams or trigrams if you had a corpus of correct text in the language of choice.
Jamey Hicks
Exactly, that's the script that I tried to remember in my post below. +1
Bruno Rothgiesser
+1  A: 

In Dev Days London, Michael Sparks presented a Python script coded exactly for that. It was surprisingly very simple! See if you can find in Google. Maybe somebody here will have the link.

Bruno Rothgiesser
According to the DevDays thread on MetaSO, the script Michael Sparks presented on was the Peter Norvig script already mentioned: http://meta.stackoverflow.com/questions/27859/devdays-london-can-i-get-hold-of-the-presentation-material/28522#28522
APC
Yes, that's correct, thanks
Bruno Rothgiesser
+2  A: 

You have at least three options

  1. You can write a program which understands the language (i.e. what a word means). This is a topic for research today. Expect the first results when you can buy a computer which is fast enough to run such a program (which is probably in 10 years when computers have become 1000 times faster than today).

  2. Use a huge corpus (text documents) to train a Hidden Marcov Model.

  3. Use a huge corpus and generate statistics about quadruplets n-grams, i.e. how often a tuple of N words appears. I don't have a link handy for this but the idea is that some words always appear in the context of other words. So when you parse your text into 4-grams and look them up in your database and you can't find one, chances are that there is something wrong with the current tuple. The next step is to find all possible matches (other 4-grams which have a small soundex or similar distance to the current one) and try the one with the highest frequency.

    Google has this data for quite a few languages and you might find more in Google labs about this.

[EDIT] After some googling, I finally found the link: On this page, you can buy English 1- to 5-grams which Google collected over the whole Internet on 6 DVDs.

Googling for "google spelling statistics n-grams" will also turn up some interesting links.

Aaron Digulla
Will Google share this data with me? ;)
EugeneP
I think so. I must really find the link again.
Aaron Digulla
@Aaron Digulla Thank you for a full and interesting answer.
EugeneP
+1  A: 

There are quite a few Java libraries for natural language processing that would help you implement a spelling corrector. But you asked about a book. Foundations of Statistical Natural Language Processing by Christopher D. Manning and Hinrich Schütze looks like a good option. The first author is a Stanford Professor leading a group that does natural language processing and developing Java libraries and NLP resources that many people use.

Jamey Hicks