tags:

views:

634

answers:

2

Hi All.

Suppose I've got multiple lucene indexes (not replicas) on several PC's.

I query each index and then merge the results. Is there any way to normalize the document scores so that I could sort by score (relevance)?

I mean, the scores for document A from index A would not be comparable with document B from index B, unless I do some sort of normalization.... not so?

Thanks Roey

+2  A: 

First, study the Lucene Similarity Documentation. Out of all the factors there, the only one that is different from one index to another is the inverse document frequency (idf).

I suggest you use Luke or a debugger to see the impact of the different indexes' idfs. You may find that this only has a minor influence.

Here is a discussion about using a global idf, and here - a Wiki page about distributed search design in Solr. I believe the problem is not yet solved.

The Lucene scoring does not lend itself to simple normalization. I suggest you try and make the document distribution as random as possible, and then compare how your hits from the two indexes rank.

Yuval F
+1 to randomly distribute the documents. You have to make sure this is indeed a problem. In most cases, the different DF values between the indexes won't really hurt you.
bajafresh4life
A: 

for comparing the score of document A for indices X and Y. I compute x = score(A,X) / max score of any document that is a hit for search on index X and y = score(A,Y) / max score of any document that is a hit for search on index Y.

Both x and y are now between 0 and 1. just add x and y to get the final score.

this is a naive approach. would like to hear your comments on this.

but i don't understand why do you want to add scores of two different documents. Use Case?

iamrohitbanga