views:

146

answers:

2

I have a set of documents containing scored items that I'd like to index. Our data structure looks like:

Document
  ID
  Text
  List<RelatedScore>

RelatedScore
  ID
  Score

My first thought was to add each RelatedScore as a multi-value field using the Boost property of the Field to modify the value of the particular score when searching.

foreach (var relatedScore in document.RelatedScores) {
  var field = new Field("RelatedScore", relatedScore.ID,
                        Field.Store.YES, Field.Index.UN_TOKENIZED);
  field.SetBoost(relatedScore.Score);
  luceneDoc.Add(field);
}

However, it appears that the "Norm" that is calculated applies to the entire multi-field - all the RelatedScore" values for a document will end up having the same score.

Is there a mechanism in Lucene to allow for this functionality? I would rather not create another index just to account for this - it feels like there should be a way using a single index. If there isn't a means to accomplish this, a few ideas that we have to compensate are :

  1. Insert the multi-value field items in order of descending value. Then somehow add a positional-aware analysis to assign higher boost/score to the first items in the field.
  2. Add a high value score multiple times to the field. So, a RelatedScore with Score==1 might be added three times, while a RelatedScore with Score==.3 would only be added once.

Both of these will result in a loss of search fidelity on these fields, yes, but they may be good enough. Any thoughts on this?

+1  A: 

This appears to be a use case for Payloads. I'm not sure if this is available in Lucene.NET, as I've only used the Java version.

Another hacky way to do this, if the absolute values of the scores aren't that important, is to discretize them (place them in buckets based on value) and create a field for each bucket. So if you have scores that range from 1 to 100, create say, 10 buckets called RelatedScore0_10, RelatedScore10_20, etc, and for any document that has a RelatedScore in that bucket, add a "true" value in that field. Then for every search that gets executed tack on an OR query like:

(RelatedScore0_10:true^1 RelatedScore10_20:true^2 ...)

The nice thing about this is that you can tweak the boost values for each one of your buckets on the fly. Otherwise you'd need to reindex to change the field norm (boost) values for each field.

bajafresh4life
I was able to use the Payload to store the score and use it via a custom similarity object. Once I started looking in this direction, I found Grant Ingersoll recently posted an article on exactly this topic here: http://www.lucidimagination.com/blog/2009/08/05/getting-started-with-payloads/
James Kolpack
A: 

If you use Lucene.Net you might not have payloads functionality yet. What you can do is convert 0-100 relevancy score to a bucket from 1-10 (integer division by 10), then add each indexed value that many times (but only store value once). Then if you search for that field, lucene built-in scoring will take into account frequency of indexed field (it will be indexed 1-10 times based on relevance). Therefore results can be sorted by variable relevance.

foreach (var relatedScore in document.RelatedScores) {
  // get bucket for relevance...
  int bucket=relatedScore.Score / 10;

  var field = new Field("RelatedScore", relatedScore.ID,
                        Field.Store.YES, Field.Index.UN_TOKENIZED);
  luceneDoc.Add(field);
  // add more instances of field but only store the first one above...
  for(int i=0;i<bucket;i++)
  {
    luceneDoc.Add(new Field("RelatedScore", relatedScore.ID,
                        Field.Store.NO, Field.Index.UN_TOKENIZED));
  }
}