views:

319

answers:

3

I am referring to the algorithm that is used to give query suggestions when a user type a search term in google.

I am mainly interested in how google algorithm is able to show: 1. Most important results (most likely queries rather than anything that matches) 2. Match substrings 3. Fuzzy matches

I know you could use Trie or generalized trie to find matches but it wouldn't meet the above requirements...

Similar questions asked earlier here

Thanks

+1  A: 

There are tools like soundex and levenshtein distance that can be used to find fuzzy matches that are within a certain range.

Soundex finds words that sound similar and levenshtein distance finds words that are within a certain edit distance from another word.

Ólafur Waage
A: 

I think that one might be better off constructing a specialized trie, rather than pursuing a completely different data structure.

I could see that functionality manifested in a trie in which each leaf had a field that reflected the frequency of searches of its corresponding word.

The search query method would display the descendant leaf nodes with the largest values calculated from multiplying the distance to each descendant leaf node by the search frequency associated with each descendant leaf node.

The data structure (and consequently the algorithm) Google uses are probably vastly more complicated, potentially taking into a large number of other factors, such as search frequencies from your own specific account (and time of day... and weather... season... and lunar phase... and... ). However, I believe that the basic trie data structure can be expanded to any kind of specialized search preference by including additional fields to each of the nodes and using those fields in the search query method.

T.K.
+1  A: 

Take a look at Firefox's Awesome bar algorithm

Google suggest is useful, because it take the millions of popular queries + your past related queries into account.
It doesn't have a good completion algorithm / UI though:
1) Doesn't do substrings
2) Seems like a relatively simple word-boundary prefix algorithm.
For example: Try "tomcat tut" --> correctly suggest "tomcat tutorial". Now try "tomcat rial" --> no suggestions )-:
3) Doesn't support "did you mean?" - as in google search results.

Dekel