views:

309

answers:

2

I have a problem with Lucene's scoring function that I can't figure out. So far, I've been able to write this code to reproduce it.

package lucenebug;

import java.util.Arrays;
import java.util.List;

import org.apache.lucene.analysis.SimpleAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;

public class Test {
 private static final String TMP_LUCENEBUG_INDEX = "/tmp/lucenebug_index";

 public static void main(String[] args) throws Throwable {
  SimpleAnalyzer analyzer = new SimpleAnalyzer();
  IndexWriter w = new IndexWriter(TMP_LUCENEBUG_INDEX, analyzer, true);
  List<String> names = Arrays
    .asList(new String[] { "the rolling stones",
      "rolling stones (karaoke)",
      "the rolling stones tribute",
      "rolling stones tribute band",
      "karaoke - the rolling stones" });
  try {
   for (String name : names) {
    System.out.println("#name: " + name);
    Document doc = new Document();
    doc.add(new Field("name", name, Field.Store.YES,
      Field.Index.TOKENIZED));
    w.addDocument(doc);
   }
   System.out.println("finished adding docs, total size: "
     + w.docCount());

  } finally {
   w.close();
  }

  IndexSearcher s = new IndexSearcher(TMP_LUCENEBUG_INDEX);
  QueryParser p = new QueryParser("name", analyzer);
  Query q = p.parse("name:(rolling stones)");
  System.out.println("--------\nquery: " + q);

  TopDocs topdocs = s.search(q, null, 10);
  for (ScoreDoc sd : topdocs.scoreDocs) {
   System.out.println("" + sd.score + "\t"
     + s.doc(sd.doc).getField("name").stringValue());
  }
 }
}

The output I get from running it is:

finished adding docs, total size: 5
--------
query: name:rolling name:stones
0.578186    the rolling stones
0.578186    rolling stones (karaoke)
0.578186    the rolling stones tribute
0.578186    rolling stones tribute band
0.578186    karaoke - the rolling stones

I just can't understand why the rolling stones has the same relevance as the rolling stones tribute. According to lucene's documentation, the more tokens a field has, the smaller the normalization factor should be, and therefore the rolling stones tribute should have a lower score than the rolling stones.

Any ideas?

A: 

I can reproduce it on Lucene 2.3.1 but do not know why this happens.

joseph
+4  A: 

The length normalization factor is calculated as 1 / sqrt(numTerms) (You can see this in DefaultSimilarity

This result is stored not stored in the index directly. This value is multiplied by the boost value for the field if specified. The final result is then encoded in 8 bits as explained in Similarity.encodeNorm() This is a lossy encoding, which means fine details get lost.

If you want to see length normalization in action, try creating document with following sentence.

the rolling stones tribute a b c d e f g h i j k

This will create sufficient difference in the length normalization values which you could see.

Now if your field have very few tokens as per the examples you have used, you could set boost values for the documents/fields based on your own formula which is essentially higher boost for short field. Alternatively, you could create custom Similarity and override legthNorm() method.

Shashikant Kore
That's right, the 8-bit encoding was rounding up the `boost*lengthNorm` and causing the problem. Setting the field boost to 100 during indexation is a clean enough workaround for me. @Shashikant Kore Thanks!
martin