ansaurus

Question

Problem with Lucene scoring

Answer 1

A:

I can reproduce it on Lucene 2.3.1 but do not know why this happens.

joseph 2009-11-04 13:24:30

Answer 2

+4 A:

The length normalization factor is calculated as 1 / sqrt(numTerms) (You can see this in DefaultSimilarity

This result is stored not stored in the index directly. This value is multiplied by the boost value for the field if specified. The final result is then encoded in 8 bits as explained in Similarity.encodeNorm() This is a lossy encoding, which means fine details get lost.

If you want to see length normalization in action, try creating document with following sentence.

the rolling stones tribute a b c d e f g h i j k

This will create sufficient difference in the length normalization values which you could see.

Now if your field have very few tokens as per the examples you have used, you could set boost values for the documents/fields based on your own formula which is essentially higher boost for short field. Alternatively, you could create custom Similarity and override legthNorm() method.

Shashikant Kore 2009-11-04 13:44:44

That's right, the 8-bit encoding was rounding up the `boost*lengthNorm` and causing the problem. Setting the field boost to 100 during indexation is a clean enough workaround for me. @Shashikant Kore Thanks!

martin 2009-11-04 14:09:22

ansaurus

tags:

views:

answers:

Problem with Lucene scoring

related questions