views:

281

answers:

4

Where can I find some real world typo statistics?

I'm trying to match people's input text to internal objects, and people tend to make spelling mistakes.
There are 2 kinds of mistakes:

  1. typos - "Helllo" instead of "Hello" / "Satudray" instead of "Saturday" etc.
  2. Spelling - "Shikago" instead of "Chicago"

I use Damerau-Levenshtein distance for the typos and Double Metaphone for spelling (Python implementations here and here).

I want to focus on the Damerau-Levenshtein (or simply edit-distance). The textbook implementations always use '1' for the weight of deletions, insertions substitutions and transpositions. While this is simple and allows for nice algorithms it doesn't match "reality" / "real-world probabilities".

Examples:

  • I'm sure the likelihood of "Helllo" ("Hello") is greater than "Helzlo", yet they are both 1 edit distance away.
  • "Gello" is closer than "Qello" to "Hello" on a QWERTY keyboard.
  • Unicode transliterations: What is the "real" distance between "München" and "Munchen"?

What should the "real world" weights be for deletions, insertions substitutions and transpositions?

Even Norvig's very cool spell corrector uses non-weighted edit distance.
BTW- I'm sure the weights need to be functions and not simple floats (per the above examples)...

I can adjust the algorithm, but where can I "learn" these weights?

I don't have access to Google-scale data...

Should I just guess them?

EDIT - trying to answer user questions:

  • My current non-weighted algorithm fails often when faced with typos for the above reasons. "Return on Tursday": every "real person" can easily tell Thursday is more likely than Tuesday, yet they are both 1-edit-distance away! (Yes, I do log and measure my performance).
  • I'm developing an NLP Travel Search engine, so my dictionary contains ~25K destinations (expected to grow to 100K), Time Expressions ~200 (expected 1K), People expressions ~100 (expected 300), Money Expressions ~100 (expected 500), "glue logic words" ("from", "beautiful", "apartment") ~2K (expected 10K) and so on...
  • Usage of the edit distance is different for each of the above word-groups. I try to "auto-correct when obvious", e.g. 1 edit distance away from only 1 other word in the dictionary. I have many other hand-tuned rules, e.g. Double Metaphone fix which is not more than 2 edit distance away from a dictionary word with a length > 4... The list of rules continues to grow as I learn from real world input.
  • "How many pairs of dictionary entries are within your threshold?": well, that depends on the "fancy weighing system" and on real world (future) input, doesn't it? Anyway, I have extensive unit tests so that every change I make to the system only makes it better (based on past inputs, of course). Most sub-6 letter words are within 1 edit distance from a word that is 1 edit distance away from another dictionary entry.
  • Today when there are 2 dictionary entries at the same distance from the input I try to apply various statistics to better guess which the user meant (e.g. Paris, France is more likely to show up in my search than Pārīz, Iran).
  • The cost of choosing a wrong word is returning semi-random (often ridiculous) results to the end-user and potentially loosing a customer. The cost of not understanding is slightly less expensive: the user will be asked to rephrase.
  • Is the cost of complexity worth it? Yes, I'm sure it is. You would not believe the amount of typos people throw at the system and expect it to understand, and I could sure use the boost in Precision and Recall.
+2  A: 

I would advise you to check the trigram alogrithm. In my opinion it works better for finding typos then edit distance algorithm. It should work faster as well and if you keep dictionary in postgres database you can make use of index.

You may find useful stackoverflow topic about google "Did you mean"

jethro
A: 

Some questions for you, to help you determine whether you should be asking your "where do I find real-world weights" question:

Have you actually measured the effectiveness of the uniform weighting implementation? How?

How many different "internal objects" do you have -- i.e. what is the size of your dictionary?

How are you actually using the edit distance e.g. John/Joan, Marmaduke/Marmeduke, Featherstonehaugh/Featherstonhaugh: is that "all 1 error" or is it 25% / 11.1% / 5.9% difference? What threshold are you using?

How many pairs of dictionary entries are within your threshold (e.g. John vs Joan, Joan vs Juan, etc)? If you introduced a fancy weighting system, how many pairs of dictionary entries would migrate (a) from inside the threshold to outside (b) vice versa?

What do you do if both John and Juan are in your dictionary and the user types Joan?

What are the penalties/costs of (1) choosing the wrong dictionary word (not the one that the user meant) (2) failing to recognise the user's input?

Will introducing a complicated weighting system actually reduce the probabilities of the above two error types by sufficient margin to make the complication and slower speed worthwhile?

BTW, how do you know what keyboard the user was using?

Update:

"""My current non-weighted algorithm fails often when faced with typos for the above reasons. "Return on Tursday": every "real person" can easily tell Thursday is more likely than Tuesday, yet they are both 1-edit-distance away! (Yes, I do log and measure my performance)."""

Yes, Thursday -> Tursday by omitting an "h", but Tuesday -> Tursday by substituting "r" instead of "e". E and R are next to each other on qwERty and azERty keyboards. Every "real person" can easily guess that Thursday is more likely than Tuesday. Even if statistics as well as guesses point to Thursday being more likely than Tuesday (perhaps omitting h will cost 0.5 and e->r will cost 0.75), will the difference (perhaps 0.25) be significant enough to always pick Thursday? Can/will your system ask "Did you mean Tuesday?" or does/will it just plough ahead with Thursday?

John Machin
Good questions. Some of the answers I've left out on purpose to make the question a little more generic and of interest to other users... Anyway, I'll edit the question to try and answer them.
Tal Weiss
I don't know which keyboard the user used, but I'm positive that QWERTY variants are more common than, say, Dvorak.
Tal Weiss
What about AZERTY keyboards?
John Machin
+1  A: 

If the research is your interest I think continuing with that algorithm, trying to find decent weights would be fruitful.

I can't help you with typo stats, but I think you should also play with python's difflib. Specifically, the ratio() method of SequenceMatcher. It uses an algorithm which the docs http://docs.python.org/library/difflib.html claim is well suited to matches that 'look right', and may be useful to augment or test what you're doing.

For python programmers just looking for typos it is a good place to start. One of my coworkers has used both Levenshtein edit distance and SequenceMatcher's ratio() and got much better results from ratio().

JivanAmara
+2  A: 

Possible source for real world typo statistics would be in the Wikipedia's complete edit history.

http://download.wikimedia.org/

Also, you might be interested in the AWB's RegExTypoFix

http://en.wikipedia.org/wiki/Wikipedia:AWB/T

tszming
+1 Very very interesting! I will certainly look into this one!
Tal Weiss