views:

615

answers:

1

My lucene index contains documents with the field "itemName". This field is boosted with a boost factor between 0 and 1. When i create a BooleanQuery i'd like that the results are ranked by the count of matched clauses and the boostfactor, so the formula looks like:

score = (count_of_matching_clauses / count_of_total_clauses + boost_factor) / 2

The score would always be a float between 0 and 1. 1 in case all clauses match and the boost factor is 1.

For example, if the field value of "itemName" for three documents with no boost factor are:

document1: "java is an island"
document2: "the secret of monkey island"
document3: "java island adventures"

and the BooleanQuery would look like:

TermQuery query1 = new TermQuery(new Term("name","java"));
TermQuery query2 = new TermQuery(new Term("name","island"));

BooleanQuery query = new BooleanQuery();
query.add(query1, BooleanClause.Occur.SHOULD);
query.add(query2, BooleanClause.Occur.SHOULD);

than document1 would be retrieved with a score of (2/2 +0)/2 = 0.5 because: count_of_matching_clauses = 2 and count_of_total_clauses = 2

document2 would be retrieved with a score of (1/2+0)/2 = 0.25 because: count_of_matching_clauses = 1 and count_of_total_clauses = 2

than document3 would be retrieved with a score of (2/2 +0)/2 = 0.5 because: count_of_matching_clauses = 2 and count_of_total_clauses = 2

How to implement this ranking mechnism in lucene? How can i tell lucene to use my custom ranking class for ranking the results?

+1  A: 

You can implement your own scoring algorithm by extending Similarity class and passing it during search. In the Javadoc of this class (follow the link), you can read the details of the scoring algorithm. Some more text on scoring can be found here. An exceptional aid to understand scoring is to actually see the explanation for the scoring as returned by Searcher.explain()

BTW, the scoring you wish to implement is the default scoring. The order of results will be as desired, though actual scores can be different than 0.5 or 0.25.

Shashikant Kore
No, this is not the default scoring. It is related to the coord() factor, but tf() and idf() may change not only the scores but also the ordering. The rest of your answer is fine - especially the explain() part.
Yuval F
Well, your answer is right, there is chance a that a document scores higher with only one query due to high tf-idf score than other doc with both query matches. But, anecdotally, I have seen that more the query matches, higher the score with DefaultSimilarity.
Shashikant Kore
does changing the similarity class also change the scoring or just the components of the standard ranking formula?
tommyL
The basic components of scoring remains the same. You can pick and choose which ones you want to change.
Shashikant Kore