tags:

views:

2249

answers:

7

Hi,

Can someone please let me know how do I implement "Did you mean" feature in Lucene.net?

Thanks!

+1  A: 

AFAIK Lucene supports proximity-search, meaning that if you use something like:

field:stirng~0.5

(it s a tilde-sign)

will match "string". the float is how "tolerant" the search would be, where 1.0 is exact match and 0.0 is match everything (sort of).

Different parsers will however implement this differently.

A proximity-search is much slower than a fuzzy-search (stri*) so use it with caution. In your case, one would assume that if you find no matches on a regular search, you try a proximity-search to see what you find, and present "did you mean" based on the result somehow.

Might be useful to cache this sort of lookups for very common mispellings, for performance reasons.

jishi
A: 

Thanks for your answer! Could you please also tell me if Lucene has any inbuilt spell checker? If not,are there any free ones available in the market?

I don't believe so. Lucene isn't aware of the language itself, however there are language-dependant parsers that is aware of common "glue"-words for different languages (is, and, or, a etc for english).
jishi
A: 

I used Lucene.Net over last summer, and I noticed that on a stock query, if I searched for "Nor**th**shore" or "Nor**ht**shore", the results were about the same (the misspelling occurred in the data once or twice), so it was my impression that it did this sort of thing automatically to some degree.

pbh101
A: 

Google's "Did you mean?" is (probably; they're secretive, of course) implemented by consulting their query log. Look to see if people who searched for the query you're processing searched for something very similar soon after; if so, it indicates they made a mistake, and realized what they ought to be searching for.

Since you probably don't have a huge query log, you could approximate it. Take the query, split up the terms, see if there are any similar terms in the database (by edit distance, whatever); replace your terms with those nearby terms, and rerun the query. If you get more hits, that was probably a better query. Suggest it to the user. (And since you've already got the hits, and most people only look at the top 2 results, show them those.)

Jay Kominek
There's a simple explanation of what "Did you mean" does here http://norvig.com/spell-correct.html, it's a very interesting read.
Matt Warren
+11  A: 

You should look into the SpellChecker module in the contrib dir. It's a port of Java lucene's SpellChecker module, so its documentation should be helpful.

(From the javadocs:)

Example Usage:

  SpellChecker spellchecker = new SpellChecker(spellIndexDirectory);
  // To index a field of a user index:
  spellchecker.indexDictionary(new LuceneDictionary(my_lucene_reader, a_field));
  // To index a file containing words:
  spellchecker.indexDictionary(new PlainTextDictionary(new File("myfile.txt")));
  String[] suggestions = spellchecker.suggestSimilar("misspelt", 5);
itsadok
this is the right answer, should be accepted! just what i was looking for ;)
Andrew Bullock
The SpellChecker module moved: https://svn.apache.org/repos/asf/lucene/lucene.net/trunk/C%23/contrib/SpellChecker.Net/
Domenic
A: 

Se this article on java.net is solves pretty much what you are after

flalar
A: 

Take a look at google code project called semanticvectors. There's a decent amount of discussion on the Lucene mailing lists for doing functionality like what you're after using it - however it is written in java.

You will probably have to parse and use some machine learning algorithms on your search logs to build a feature like this!

Max