views:

3165

answers:

16

Suppose you have a search system already in your website. How can you implement the "Did you mean: " like Google does in some search queries?

+4  A: 

I would suggest looking at SOUNDEX to find similar words in your database.

You can also access google own dictionary by using the Google API spelling suggestion request.

Espo
+1 for the link to the Google API which seems to be exactly what the asker was looking for, even if the chosen answer is more in depth and answers the 'why' and 'how' of Google's implementation.
dimo414
A: 

Soundex is good for phonetic matches, but works best with peoples' names (it was originally developed for census data)

Also check out Full-Text-Indexing, the syntax is different from Google logic, but it's very quick and can deal with similar language elements.

Keith
one of the bad things of soundex is that it's too english-centric
Javier
It was developed to Anglisize names, so Smith and Schmidt are suppose to match in it. Metaphone is better but has a similar problem. Any phonetic algorithm is going to be language dependant.
Keith
A: 

Soundex and "Porter stemming" (soundex is trivial, not sure about porter stemming).

Michael Neale
Information (including implementations in 19 different coding languages) on Porter stemming can be found at http://tartarus.org/~martin/PorterStemmer/index.html
msanders
+11  A: 

Check this article on wikipedia about the Levenshtein distance. Make sure you take a good look at Possible improvements.

+52  A: 

Actually what Google does is very much non-trivial and also at first counter-intuitive. They don't do anything like check against a dictionary, but rather they make use of statistics to identify "similar" queries that returned more results than your query, the exact algorithm is of course not known.

There are different sub-problems to solve here, as a fundamental basis for all Natural Language Processing statistics related there is one must have book: Foundation of Statistical Natural Language Processing.

Concretely to solve the problem of word/query similarity I have had good results with using Edit Distance, a mathematical measure of string similarity that works surprisingly well. I used to use Levenshtein but the others may be worth looking into.

Soundex - in my experience - is crap.

Actually efficiently storing and searching a large dictionary of misspelled words and having sub second retrieval is again non-trivial, your best bet is to make use of existing full text indexing and retrieval engines (i.e. not your database's one), of which Lucene is currently one of the best and coincidentally ported to many many platforms.

Boris Terzic
+2  A: 

If you have industry specific translations, you will likely need a thesaurus. For example, I worked in the jewelry industry and there were abbreviate in our descriptions such as kt - karat, rd - round, cwt - carat weight... Endeca (the search engine at that job) has a thesaurus that will translate from common misspellings, but it does require manual intervention.

Greg Ogle
+3  A: 

I think this depends on how big your website it. On our local Intranet which is used by about 500 member of staff, I simply look at the search phrases that returned zero results and enter that search phrase with the new suggested search phrase into a SQL table.

I them call on that table if no search results has been returned, however, this only works if the site is relatively small and I only do it for search phrases which are the most common.

You might also want to look at my answer to a similar question:

GateKiller
+4  A: 

You may want to look at Peter Norvig's "How to Write a Spelling Corrector" article.

FA
perfect, thanks!
David McLaughlin
+4  A: 

I believe Google logs all queries and identifies when someone makes a spelling correction. This correction may then be suggested when others supply the same first query. This will work for any language, in fact any string of any characters.

Liam
They do indeed. This helps them learn new words easily -- they have the help of millions.
A. Rex
+7  A: 

Google's Did you mean is definitely simple spell checking! Google's Peter Norvig even gives a 20ish line Python implementation:

http://googlesystem.blogspot.com/2007/04/simplified-version-of-googles-spell.html

http://www.norvig.com/spell-correct.html

Dr Norvig also discusses the "did you mean" in this excellent talk.

SOUNDEX and other guesses don't get a look in, people!

Will
Dr. Norvig provided a toy example of the concept; it's not nearly accurate enough to provide 'did you mean' for web. For example: "barak" does not produce a suggestion; "barak obama" does (since they know "barack" occurs often with obama, and can infer the likely correction
SquareCog
it isn't hard to go from his toy spell checker to something that does handle your example and that works well. An important thing to remember is that he is showing a spell checker which is subtly but significantly different from a query suggester. Training it with previous queries instead of english text is a good place to start.
jshen
There's definitely more to it than just spell-checking. For one thing, I've seen cases where neither the thing I typed nor the suggested replacement are "dictionary words".
Kyralessa
@Kyralessa: do you think their dictionary is static words in some language, or dynamic and based on the words on the internet and common search terms? It doesn't it isn't still a dictionary check. Dr Novig is after all head of research at google - when asked how "did you mean" is implemented, his answer is authoritive.
Will
A: 

There's something called aspell that might help: http://blog.evanweaver.com/files/doc/fauna/raspell/classes/Aspell.html

There's a ruby gem for it, but I don't know how to talk to it from python http://blog.evanweaver.com/files/doc/fauna/raspell/files/README.html

Here's a quote from the ruby implementation

Usage

Aspell lets you check words and suggest corrections. For example:

  string = "my haert wil go on"

  string.gsub(/[\w\']+/) do |word|
    if !speller.check(word)
      # word is wrong
      puts "Possible correction for #{word}:"
      puts speller.suggest(word).first
    end
  end

This outputs:

Possible correction for haert: heart Possible correction for wil: Will

A: 

Implementing spelling correction for search engines in an effective way is not trivial (you can't just compute the edit/levenshtein distance to every possible word). A solution based on k-gram indexes is described in Introduction to Information Retrieval (full text available online).

Fabian Steeg
A: 

I do it with Lucene's Spell Checker.

cherouvim
+5  A: 

I was pleasantly surprised that someone has asked how to create a state-of-the-art spelling suggestion system for search engines. I have been working on this subject for more than a year for a search engine company and I can point to information on the public domain on the subject.

As was mentioned in a previous post, Google (and Microsoft and Yahoo!) do not use any predefined dictionary nor do they employ hordes of linguists that ponder over the possible misspellings of queries. That would be impossible due to the scale of the problem but also because it is not clear that people could actually correctly identify when and if a query is misspelled.

Instead there is a simple and rather effective principle that is also valid for all European languages. Get all the unique queries on your search logs, calculate the edit distance between all pairs of queries, assuming that the reference query is the one that has the highest count.

This simple algorithm will work great for many types of queries. If you want to take it to the next level then I suggest you read the paper by Microsoft Research on that subject. You can find it here

The paper has a great introduction but after that you will need to be knowledgeable with concepts such as the Hidden Markov Model.

Costas Boulis
A: 

U could use ngram for the comparisment: http://en.wikipedia.org/wiki/N-gram

Using python ngram module: http://packages.python.org/ngram/index.html

import ngram

G2 = ngram.NGram([  "iis7 configure ftp 7.5",
                    "ubunto configre 8.5",
                    "mac configure ftp"])

print "String", "\t", "Similarity"
for i in G2.search("iis7 configurftp 7.5", threshold=0.1):
    print i[1], "\t", i[0]

U get:

>>> 
String  Similarity
0.76    "iis7 configure ftp 7.5"    
0.24    "mac configure ftp"
0.19    "ubunto configre 8.5"   
hugo24