tags:

views:

295

answers:

7

When searching the db with terms that retrieve no results I want to allow "did you mean..." suggestion (like Google). So for example if someone looks for "jquyer" ", it would output "did you mean jquery?"

Of course, suggestion results have to be matched against the values inside the db (i'm using mysql).

Do you know a library that can do this? I've googled this but haven't found any great results. Or perhaps you have an idea how to construct this on my own?

+2  A: 

How about the levenshtein function, or similar_text function?

mgroves
Not sure how well that would work, you'd have to individually apply it to every word in the search string against every word in a dictionary.
Rob
The Levenshtein function, and other string similarity measures are not practical if you have more than a few thousand different words. Essentially one needs to compare every single word in the database table with the user-supplied keyword... These expensive string similarity functions can possibly be applied as the _final stage_ of a similar word searching algorithm, provided that some filtering is possible at the level of the database, to limit the number of comparaison effectively needed.
mjv
+1  A: 

You should keep track of common misspellings that come through your search (or generate some yourself with a typo generator) and store the misspelling and the word it matches in a database. Then, when you have nothing matching any search results, you can check against the misspelling table, and use the suggested word.

GSto
Excellent link, cheers!
Daniel May
thanks, though I was looking for something a bit more practical and sustainable.
sombe
+4  A: 

A quick and easy solution involves SOUNDEX or SOUNDEX-like functions.

In a nutshell the SOUNDEX function was originally used to deal with common typos and alternate spellings for family names, and this function, encapsulates very well many common spelling mistakes (in the english language). Because of its focus on family names, the original soundex function may be limiting (for example encoding stops after the third or fourth non-repeating consonant letter), but it is easy to expend the algorithm.

The interest of this type of function is that it allows computing, ahead of time, a single value which can be associated with the word. This is unlike string distance functions such as edit distance functions (such as Levenshtein, Hamming or even Ratcliff/Obershelp) which provide a value relative to a pair of strings.

By pre-computing and indexing the SOUNDEX value for all words in the dictionary, one can, at run-time, quickly search the dictionary/database based on the [run-time] calculated SOUNDEX value of the user-supplied search terms. This Soundex search can be done systematically, as complement to the plain keyword search, or only performed when the keyword search didn't yield a satisfactory number of records, hence providing the hint that maybe the user-supplied keyword(s) is (are) misspelled.


A totally different approach, only applicable on user queries which include several words, is based on running multiple queries against the dictionary/database, excluding one (or several) of the user-supplied keywords. These alternate queries' result lists provide a list of distinct words; This [reduced] list of words is typically small enough that pair-based distance functions can be applied to select, within the list, the words which are closer to the allegedly misspelled word(s). The word frequency (within the results lists) can be used to both limit the number of words (only evaluate similarity for the words which are found more than x times), as well as to provide weight, to slightly skew the similarity measurements (i.e favoring words found "in quantity" in the database, even if their similarity measurement is slightly less).

mjv
I noticed there was a mysql builtin function called "sounds like" which takes advantage of soundex. so "select * from table where column sounds like $string;" would work, i'm just not sure it would go as smoothly and quickly.
sombe
@Gal, The sounds like function likely performs `SOUNDEX(left) = SOUNDEX(right)`. If you use an index you will have to compare the `SOUNDEX` of the `$string` manually in the SQL statement.
strager
@Gal I don't know of `sound_like` in MySQL, but if it is as defined by strager, you do NOT want to use that. Instead you need to use MySQL SOUNDEX() function to populate a new field in your dictionary (or in your keyword-to-Document table, if you do not have an dictionary). This new field, call it SoundexVal, can then be used at search time, to either find similar words (in dictionary) or directly find records that approximately match the user's search criteria. In either case you use a query like `...WHERE SOUNDEX(user-supplied-word) = SoundexVal`. In this fashion SQL only computes...
mjv
... one soundex value, and since the SoudexVal is [typically] index, the query is resolved very quickly.
mjv
Another consideration is to your your own function rather than MySQL's SOUNDEX() to avoid some of the limitations associated with this pure SOUNDEX algorithm (limit on word length etc..) Such a home-made function can be implemented PHP-side, and you then compute the value in the application and pass it to SQL which uses it as-is rather than having a SOUNDEX call in the query itself.
mjv
+2  A: 

Actually, I believe Google's "did you mean" function is generated by what users type in after they've made a typo. However, that's obviously a lot easier for them since they have unbelievable amounts of data.

You could use Levenshtein distance as mgroves suggested (or Soundex), but store results in a database. Or, run separate scripts based on common misspellings and your most popular misspelled search terms.

DisgruntledGoat
+1 That’s a clever solution.
Gumbo
As far as I know, Google's suggestions are based on how common words are in their index. If you search for 'murmoset', the engine will say 'hmmm, there are very few hits for that, but tons for "marmoset"', and will suggest that instead. Of course, that doesn't explain how they decide that the two words are similar.
Nathan Long
A: 

http://www.phpclasses.org/browse/package/4859.html

Here's an off-the-shelf class that's rather easy to implement, which employs minimum edit distance. All you need to do is have a token (not type) list of all the words you want to work with handy. My suggestion is to make sure it's the complete list of words within your search index, and only within your search index. This helps in two ways:

  • Domain specificity helps avoid misleading probabilities from overtaking your implementation
    • Ex: "Memoize" may be spell-corrected to "Memorize" for most off-the-shelf, dictionaries, but that's a perfectly good search term for a computer science page.
  • Proper nouns that are available within your search index are now accounted for.
    • Ex: If you're Dell, and someone searches for 'inspiran', there's absolutely no chance the spell-correct function will know you mean 'inspiron'. It will probably spell-correct to 'inspiring' or something more common, and, again, less domain-specific.
Robert Elwell
A: 

Writing your own custom solution will take quite some time and is not guaranteed to work if your dataset isn't big enough, so I'd recommend using an API from a search giant such as Yahoo. Yahoo's results aren't as good as Google's but I'm not sure whether Google's is meant to be public.

Josh Davis
A: 

When I did this a couple of years ago, I already had a custom built index of words that the search engine used. I studied what kinds of errors people made the most (based on logs) and sorted the suggestions based on how common the mistake was.

If someone searched for jqeury, I would build a select-statement that went
SELECT Word,1 AS Relevance FROM keywords WHERE Word IN ('qjuery','juqery','jqeury' etc)
UNION SELECT Word, 2 AS Relevant FROM keywords WHERE Word LIKE 'j_query' OR Word LIKE 'jq_uery' etc
etc ORDER BY Relevance,Word

The resulting words were my suggestions and it worked really well.

Oliver John