ansaurus

Question

Machine learning for typos.

Answer 1

+3 A:

maybe these links can help

http://answers.google.com/answers/threadview?id=526503

http://googlesystem.blogspot.com/2007/04/simplified-version-of-googles-spell.html

http://en.wikipedia.org/wiki/Spell_checker

動靜能量 2009-06-04 10:32:44

Answer 2

+1 A:

just having frequency count of words is enough, I dont think you need somehthing complicated for this, not even machine learning. No need of learn a model. If you have entered something strange but not a typo you will notice that they try to "correct" it also.

Jordi 2009-06-04 10:32:49

Answer 3

+2 A:

Peter Norvig (Director of Research at Google) has written a little introductory piece on spell checking in Python using statistical heuristics.

It's a great read, and shows how to use statistical heuristics in a very simple way. It could be ported to C# (with LINQ) quite easily (Python's list comprehensions are very close to Linq expressions).

The core part of this snippet is having all simple typos for a word (edit1 function) the C# equivalent is the following

public static IEnumerable<string> Edit1(string word){
const string alphabet = "abcdefghijklmnopqrstuvwxyz";
var s = from i in Enumerable.Range (0, word.Length - 1)
        select new Pair<string>(word.Substring (0, i), word.RSlice(i));

var deletes = from p in s 
       select p.First + p.Second.RSlice (1);

var transposes = from p in s 
   let b1 = p.Second 
   where b1.Length > 2 
   select p.First + b1 [1] + b1 [0] + b1.RSlice (2);

var replaces = from p in s 
        let b = p.Second 
        where b.Length > 0 
        from c in alphabet select p.First + c + b.RSlice (1);

var inserts = from p in s 
       from c in alphabet 
       select p.First + c + p.Second;

return deletes.Concat (transposes).Concat( replaces)
     .Concat(inserts).Distinct ();}

where Pair is a poor man tuple (obvious code not included) and RSlice is a poor-man string-only right splicing :

public static class Extensions {
 public static string RSlice (this string input, int i)
 {
  if (i > input.Length - 1)
   return "";
  return input.Substring (i);
 }}

Once you got the edits for a word, you look for the word in the dictionary or the existing words of the edits (selecting the most frequent word) or the words for edits1(edits1(word)) (selecting the most frequent word). Surprisingly, this can be quite fast and quite accurate. I'll have a link to my blog for the whole stuff ported.

Edit: oops just saw that a link in an answer above pointed to a pointer to the same Norvig's piece...

Yann Schwartz 2009-06-04 11:10:30

Are you really gonna post a C# version? heee man I just wanted to know,but if you are an ace hey well what can I say

abmv 2009-06-04 14:55:49

I just finished a raw port. It's a bit ugly and clocks at ~100 lines instead of the ~20 original one (hey C#4, what about forgetting nice COM interop and having slices in lists?)

Yann Schwartz 2009-06-04 15:00:49

+1 for C# translation. Very nice and useful for "the rest of us".

Robert Koritnik 2009-12-29 08:59:38

ansaurus

tags:

views:

answers:

Machine learning for typos.

related questions