views:

277

answers:

3

Hi.
I need to exclude duplicates in my database. The problem is that duplicates are not considered exact match but rather similar documents. For this purpose I decided to use FuzzyQuery like follows:

var fuzzyQuery = new global::Lucene.Net.Search.FuzzyQuery(
                     new Term("text", queryText),
                     0.8f,
                     0);
 hits = _searcher.Search(query);

The idea was to set the minimal similarity to 0.8 (that I think is high enough) so only similar documents will be found excluding those that are not sufficiently similar.

To test this code I decided to see if it finds already existing document. To the variable queryText was assigned a value that is stored in the index. The code from above found nothing, in other words it doesn't detect even exact match.

Index was build by this code:

 doc.Add(new global::Lucene.Net.Documents.Field(
            "text",
            text,
            global::Lucene.Net.Documents.Field.Store.YES,
            global::Lucene.Net.Documents.Field.Index.TOKENIZED,
            global::Lucene.Net.Documents.Field.TermVector.WITH_POSITIONS_OFFSETS));

I followed recomendations from bellow and the results are: TermQuery doesn't return any result. Query contructed with

 var _analyzer = new RussianAnalyzer();
 var parser = new global::Lucene.Net.QueryParsers
                .QueryParser("text", _analyzer);
 var query = parser.Parse(queryText);
 var _searcher = new IndexSearcher
       (Settings.General.Default.LuceneIndexDirectoryPath);
 var hits = _searcher.Search(query);

Returns several results with the maximum score the document that has exact match and other several documents that have similar content.

+2  A: 

It might help to look inside the index - will clearly show what data you're querying against and how Lucene 'sees' you data. You can use Luke for this. It has some known compatibility issues with Lucent.NET but is much better than nothing anyway.

AlexS
+1  A: 

I second the recommendation for Luke. A few other things to try:

  1. Try first an exact query, say a TermQuery for the term "text". If this doesn't work, no fuzzy query will.
  2. Use Explain() to see how the scoring went (that is provided you get other hits).
  3. Follow the suggestions from Debugging Relevance Issues in Search.
Yuval F
If TermQuery doesn't return any result? But stil ordinary search does. Could it be connected with the fact that search is done in Russian?
Jenea
I suspect that using different analyzers for indexing and retrieval is to blame. Try using RussianAnalyzer for both - use the RussianAnalyzer for the FuzzyQuery search as well.
Yuval F
A: 

Try the MoreLikeThis class in Lucene...it has some great heuristics encoded that would help you identify "similar" documents.

Mikos
Yes but I also need to know the degree of similarity in absolute numbers.
Jenea
1. Use MLT to retrieve similar docs using Lucene2. Use Cosine similarity algorithm to measure similarity. Simmetrics is a great F/OSS library (has a .net implementation as well)
Mikos