ansaurus

Question

How do I detect if there is already a similar document stored in Lucene index.

Answer 1

+2 A:

It might help to look inside the index - will clearly show what data you're querying against and how Lucene 'sees' you data. You can use Luke for this. It has some known compatibility issues with Lucent.NET but is much better than nothing anyway.

AlexS 2010-02-09 20:26:40

Answer 2

+1 A:

I second the recommendation for Luke. A few other things to try:

Try first an exact query, say a TermQuery for the term "text". If this doesn't work, no fuzzy query will.
Use Explain() to see how the scoring went (that is provided you get other hits).
Follow the suggestions from Debugging Relevance Issues in Search.

Yuval F 2010-02-10 08:21:24

If TermQuery doesn't return any result? But stil ordinary search does. Could it be connected with the fact that search is done in Russian?

Jenea 2010-02-10 11:00:52

I suspect that using different analyzers for indexing and retrieval is to blame. Try using RussianAnalyzer for both - use the RussianAnalyzer for the FuzzyQuery search as well.

Yuval F 2010-02-10 14:34:24

Answer 3

A:

Try the MoreLikeThis class in Lucene...it has some great heuristics encoded that would help you identify "similar" documents.

Mikos 2010-04-02 02:01:41

Yes but I also need to know the degree of similarity in absolute numbers.

Jenea 2010-04-02 08:06:48

1. Use MLT to retrieve similar docs using Lucene2. Use Cosine similarity algorithm to measure similarity. Simmetrics is a great F/OSS library (has a .net implementation as well)

Mikos 2010-04-02 13:07:03

ansaurus

tags:

views:

answers:

How do I detect if there is already a similar document stored in Lucene index.

related questions