views:

10799

answers:

18

I've been developing an internal website for a portfolio management tool. There is a lot of text data, company names etc. I've been really impressed with some search engines ability to very quickly respond to queries with "Did you mean: xxxx".

I need to be able to intelligently take a user query and respond with not only raw search results but also with a "Did you mean?" response when there is a highly likely alternative answer etc

[I'm developing in ASP.NET (VB - don't hold it against me! )]

UPDATE: OK, how can I mimic this without the millions of 'unpaid users'?

  • Generate typos for each 'known' or 'correct' term and perform lookups?
  • Some other more elegant method?
+28  A: 

I found this article some time ago: http://www.norvig.com/spell-correct.html.

It's an interesting read about the "spelling correction" topic. The examples are in Python but it's clear and simple to understand, and I think that the algorithm can be easily translated to other languages.

Davide Gualano
I didn't realize Peter Norvig was at Google. Norvig is kind of a giant in AI. He wrote AI: A Modern Approach.
BobbyShaftoe
@Davide: """the examples are in python but it's clear and simple to understand""": I don't understand your use of "but" ... I'd say given Python + Norvig's writing style, "clear and simple to understand" is the expected outcome.
John Machin
The "but" was there because Harry said in his question that he is a VB.NET developer, so I assumed he wasn't confident with python language.
Davide Gualano
+2  A: 

As a guess...it could 1) search for words 2) if it is not found use some algorithm to try to "guess" the word. Could be something from AI like Hopfield network or back propagation network, or something else "identifying fingerprints", restoring broken data, or spelling corrections as Davide mentioned already...

badbadboy
+7  A: 

Hmm... I thought that google used their vast corpus of data (the internet) to do some serious NLP (Natural Language Processing).

For example, they have so much data from the entire internet that they can count the number of times a three-word sequence occurs (known as a trigram). So if they see a sentence like: "pink frugr concert", they could see it has few hits, then find the most likely "pink * concert" in their corpus.

They apparently just do a variation of what Davide Gualano was saying, though, so definitely read that link. Google does of course use all web-pages it knows as a corpus, so that makes its algorithm particularly effective.

Claudiu
+2  A: 

I saw something on this a few years back, so may have changed since, but apparently they started it by analysing their logs for the same users submitting very similar queries in a short space of time, and used machine learning based on how users had corrected themselves.

seanb
+2  A: 

Simple. They have tons of data. They have statistics for every possible term, based on how often it is queried, and what variations of it usually yield results the users click... so, when they see you typed a frequent misspelling for a search term, they go ahead and propose the more usual answer.

Actually, if the misspelling is in effect the most frequent searched term, the algorythm will take it for the right one.

schonarth
Nobody has doubted that Google has all the necessary data to do this, but the question was asking for details on how Google has come up with an algorithm to do this, with so much data, in a reasonable amount of time. They would have gazillions of searches a day - how do they easily identity whether a search term is a 'spelling correction' of another, recent one? What factors make Google decide that one term is a misspelling of another? These are implementation details that would be of interest.
thomasrutter
+5  A: 

My guess is that they use a combination of a Levenshtein distance algorithm and the masses of data they collect regarding the searches that are run. They could pull a set of searches that have the shortest Levenshtein distance from the entered search string, then pick the one with the most results.

Jim Burger
Let's say that you have a total of billions of web pages' worth of words stored. There is no easy way to index Levenshtein distance for fast retrieval of near matches without calculating the Levenshtein distance some billions of times for every word queried. Levenshtein distance is therefore not of much use in this situation, at least not in the first stage, where Google needs to narrow down from billions of existing words to just those words which are likely to be misspellings of the current word. It can definitely apply Levenshtein as a later step once it has already fetched likely matches.
thomasrutter
+75  A: 

Here's the explanation directly from the source ( almost )

Search 101!

at min 22:03

Worth watching!

Basically and according to Douglas Merrill former CTO of Google it is like this:

1) You write a ( misspelled ) word in google

2) You don't find what you wanted ( don't click on any results )

3) You realize you misspelled the word so you rewrite the word in the search box.

4) You find what you want ( you click in the first links )

This pattern multiplied millions of times, shows what are the most common misspells and what are the most "common" corrections.

This way Google can almost instantaneously, offer spell correction in every language.

Also this means if overnight everyone start to spell night as "nigth" google would suggest that word instead.

EDIT

@ThomasRutter: Douglas describe it as "statistical machine learning".

They know who correct the query, because they know which query comes from which user ( using cookies )

If the users perform a query, and only 10% of the users click on a result and 90% goes back and type another query ( with the corrected word ) and this time that 90% clicks on a result, then they know they have found a correction.

They can also know if those are "related" queries of two different, because they have information of all the links they show.

Furthermore, they are now including the context into the spell check, so they can even suggest different word depending on the context.

See this demo of google wave ( @ 44m 06s ) that shows how the context is taken into account to automatically correct the spelling.

Here it is explained how that natural language processing works.

And finally here is an awesome demo of what can be done adding automatic machine translation ( @ 1h 12m 47s ) to the mix.

I've added anchors of minute and seconds to the videos to skip directly to the content, if they don't work, try reloading the page or scrolling by hand to the mark.

OscarRyz
watching the video now, pretty entertaining
Harry
How does the algorithm work though? How does Google go from "We receive billions of searches with various terms, and these are those searches" to "this term must therefore be a common misspelling of this term"? They have solved this problem, but I am interested in how. How do they figure that two searches are from the same user, and which word is a 'correction' of another, and how to they aggregate this over billions of searches?
thomasrutter
If everyone started misspelling "night" ... I believe they already ran into this with people searching for "Flickr."
Max Lybbert
What do factorials have to do with the question?
micmoo
the problem with everyone misspelling something has already happened in a much more severe sense: Try typing 'fuscia' into Google. Google says "Did you mean fuschia?" The correct spelling, in fact, is "fuchsia," but no one can spell it correctly for some reason. The problem is even worse on Dictionary.com; if you type "fuschia" into their search, it gives you "No results for fuschia. Did you mean 'fuschia'?" (i.e., did you mean what you just typed?)
David Hollman
+1 "statistical machine learning"
Lazer
+14  A: 

For the theory of "did you mean" algorithm you can refer to Chapter 3 of Introduction to Information Retrieval. It is available online for free. Section 3.3 (page 52) exactly answers your question. And to specifically answer your update you only need a dictionary of words and nothing else (including millions of users).

Szere Dyeri
Thank you for the link.
Jonas Kongslund
Interesting reading to know the details on the implementation +1
OscarRyz
+2  A: 

Hi, regarding your question how to mimic the behavior without having tons of data - why not use tons of data collected by google? Download the google sarch results for the misspelled word and search for "Did you mean:" in the HTML.

I guess that's called mashup nowadays :-)

Tomas Petricek
how long until google stops your bot from scraping? - or wouldn't google even notice these days?
Harry
I don't think they'll notice if the reqs/sec aren't too high.
Mauricio Scheffer
A: 

Easiest way to figure it out is to Google dynamic programming.

It's an algorithm that's been borrowed from Information Retrieval and is used heavily in modern day bioinformatics to see how similiar two gene sequences are.

Optimal solution uses dynamic programming and recursion.

This is a very solved problem with lots of solutions. Just google around until you find some open source code.

ewakened
A: 

It should be obvious that they aren't simply doing spelling correction due to the various funny results that have popped up from time to time with real words. Things like:

"French military victories" -- no hits, did you mean "French military defeats"? That's humans using related words, not spelling.

Likewise "What to do if the grill gets wet" suggesting "What to do if the girl gets wet".

Loren Pechtel
The "French military defeats" thing is just a joke. The "... victories" search brings ~300K results. The first one is the joke page with "did you mean 'French military defeats'?" If you press "I'm feeling lucky" you go straight to the joke page. How they got to be the top result is another story.
P Daddy
It was not a joke, it really worked that way. It's caused a lot of comments since then that now get hits.
Loren Pechtel
Even if there were never a French military victory, a google search for French military victories would get hits about Nazi Germany having military victories, including defeating the French.
Andrew Grimm
+3  A: 

Google apparently suggests queries with best results, not with those which are spelled correctly. But in this case, probably a spell-corrector would be more feasible, Of course you could store some value for every query, based on some metric of how good results it returns.

So,

  1. You need a dictionary (english or based on your data)

  2. Generate a word trellis and calculate probabilities for the transitions using your dictionary.

  3. Add a decoder to calculate minimum error distance using your trellis. Of course you should take care of insertions and deletions when calculating distances. Fun thing is that QWERTY keyboard maximizes the distance if you hit keys close to each other.(cae would turn car, cay would turn cat)

  4. Return the word which has the minimum distance.

  5. Then you could compare that to your query database and check if there is better results for other close matches.

Geee
A: 

This is quite interesting a post!

I'm writing a search engine but irrespective of algorithms used, there's a higher probability that the commonly misspeled words will always be misspelled by all generations as long as the language is the same.

It's interesting how few people have thought about the fact that some spelling mistakes are due to keyboard errors - Google ia also succeeding in trying to cover this using the concept of most common misspells and the most "common" corrections.

The future of search engines is flashing green!

God bless,

Chris Musasizi.

Have you read John 3:3, John 3:16?

+3  A: 

Normally a production spelling corrector utilizes several methodologies to provide a spelling suggestion. Some are:

  • Decide on a way to determine whether spelling correction is required. These may include insufficient results, results which are not specific or accurate enough (according to some measure), etc. Then:

  • Use a large body of text or a dictionary, where all, or most are known to be correctly spelled. These are easily found online, in places such as LingPipe. Then to determine the best suggestion you look for a word which is the closest match based on several measures. The most intuitive one is similar characters. What has been shown through research and experimentation is that two or three character sequence matches work better. (bigrams and trigrams). To further improve results, weigh a higher score upon a match at the beginning, or end of the word. For performance reasons, index all these words as trigrams or bigrams, so that when you are performing a lookup, you convert to n-gram, and lookup via hashtable or trie.

  • Use heuristics related to potential keyboard mistakes based on character location. So that "hwllo" should be "hello" because 'w' is close to 'e'.

  • Use a phonetic key (Soundex, Metaphone) to index the words and lookup possible corrections. In practice this normally returns worse results than using n-gram indexing, as described above.

  • In each case you must select the best correction from a list. This may be a distance metric such as levenshtein, the keyboard metric, etc.

  • For a multi-word phrase, only one word may be misspelled, in which case you can use the remaining words as context in determining a best match.

eulerfx
A: 

There is a specific data structure - ternary search tree - that naturally supports partial matches and near-neighbor matches.

A: 

Try an image search of a person I used to know, "Robert Wiander". This ends up with "Did you mean 'Robert Winder'?" and I don't get ANY of the pictures on the person I was looking for.

Go even further, search for: "robert wiander" -winder and you get "Did you mean 'robert wander' -wonder?"

How does one turn this bastard of a feature OFF?!?

Aki
Google doesn’t hide search results from you, even if it thinks you misspelled the query. If you didn’t find anything that means that Google didn’t, either. Nothing to do with this feature.
Konrad Rudolph
@Konrad: "Google doesn’t hide search results from you" - unless you're in _insert_country_name_, or a DMCA has been filed, or ... ( see also http://en.wikipedia.org/wiki/Censorship_by_Google )
Piskvor
[@Aki] ouuuuu you didnt just call Google (or their features) a bastard ? You will now be destoryed !!
user279521
+2  A: 

Use Levenshtein distance, then create a Metric Tree (or Slim tree) to index words. Then run a 1-Nearest Neighbour query, and you got the result.

Nicolas Dorier
A: 

Here is the VB.NET version of the Norvig Spelling Corrector. You may find this useful if it is not too late!

Ralph Wiggum