views:

58

answers:

2

I want my search results to order by score, which they are doing, but the score is being calculated improperly. This is to say, not necessarily improperly, but differently than expected and I'm not sure why. My goal is to remove whatever is changing the score.

If I perform a search that matches on two objects (where ObjectA is expected to have a higher score than ObjectB), ObjectB is being returned first.

Let's say, for this example, that my query is a single term: "apples".

ObjectA's title: "apples are apples" (2/3 terms)
ObjectA's description: "There were apples in the apples-apples and now the apples went all apples all over the apples!" (6/18 terms)
ObjectB's title: "apples are great" (1/3 terms)
ObjectB's description: "There were apples in the apples-room and now the apples went all bad all over the apples!" (4/18 terms)

The title field has no boost (or rather, a boost of 1) and the description field has a boost of 0.8. I have not specified a document boost through solrconfig.xml or through the query that I'm passing through. If there is another way to specify a document boost, there is the chance that I'm missing one.

After analyzing the explain printout, it looks like ObjectA is properly calculating a higher score than ObjectB, just like I want, except for one difference: ObjectB's title fieldNorm is always higher than ObjectA's.


Here follows the explain printout. Just so you know: the title field is mditem5_tns and the description field is mditem7_tns:

ObjectB:
1.3327172 = (MATCH) sum of:
  1.0352166 = (MATCH) max plus 0.1 times others of:
    0.9766194 = (MATCH) weight(mditem5_tns:appl in 0), product of:
      0.53929156 = queryWeight(mditem5_tns:appl), product of:
        1.8109303 = idf(docFreq=3, maxDocs=9)
        0.2977981 = queryNorm
      1.8109303 = (MATCH) fieldWeight(mditem5_tns:appl in 0), product of:
        1.0 = tf(termFreq(mditem5_tns:appl)=1)
        1.8109303 = idf(docFreq=3, maxDocs=9)
        1.0 = fieldNorm(field=mditem5_tns, doc=0)
    0.58597165 = (MATCH) weight(mditem7_tns:appl^0.8 in 0), product of:
      0.43143326 = queryWeight(mditem7_tns:appl^0.8), product of:
        0.8 = boost
        1.8109303 = idf(docFreq=3, maxDocs=9)
        0.2977981 = queryNorm
      1.3581977 = (MATCH) fieldWeight(mditem7_tns:appl in 0), product of:
        2.0 = tf(termFreq(mditem7_tns:appl)=4)
        1.8109303 = idf(docFreq=3, maxDocs=9)
        0.375 = fieldNorm(field=mditem7_tns, doc=0)
  0.2975006 = (MATCH) FunctionQuery(1000.0/(1.0*float(top(rord(lastmodified)))+1000.0)), product of:
    0.999001 = 1000.0/(1.0*float(1)+1000.0)
    1.0 = boost
    0.2977981 = queryNorm

ObjectA:
1.2324848 = (MATCH) sum of:
  0.93498427 = (MATCH) max plus 0.1 times others of:
    0.8632177 = (MATCH) weight(mditem5_tns:appl in 0), product of:
      0.53929156 = queryWeight(mditem5_tns:appl), product of:
        1.8109303 = idf(docFreq=3, maxDocs=9)
        0.2977981 = queryNorm
      1.6006513 = (MATCH) fieldWeight(mditem5_tns:appl in 0), product of:
        1.4142135 = tf(termFreq(mditem5_tns:appl)=2)
        1.8109303 = idf(docFreq=3, maxDocs=9)
        0.625 = fieldNorm(field=mditem5_tns, doc=0)
    0.7176658 = (MATCH) weight(mditem7_tns:appl^0.8 in 0), product of:
      0.43143326 = queryWeight(mditem7_tns:appl^0.8), product of:
        0.8 = boost
        1.8109303 = idf(docFreq=3, maxDocs=9)
        0.2977981 = queryNorm
      1.6634457 = (MATCH) fieldWeight(mditem7_tns:appl in 0), product of:
        2.4494898 = tf(termFreq(mditem7_tns:appl)=6)
        1.8109303 = idf(docFreq=3, maxDocs=9)
        0.375 = fieldNorm(field=mditem7_tns, doc=0)
  0.2975006 = (MATCH) FunctionQuery(1000.0/(1.0*float(top(rord(lastmodified)))+1000.0)), product of:
    0.999001 = 1000.0/(1.0*float(1)+1000.0)
    1.0 = boost
    0.2977981 = queryNorm
+1  A: 

FieldNOrm is computed of 3 components - index-time boost on the field, index-time boost on the document and field length. Assuming that you are not supplying any index-time boost, the difference must be field length.

Thus, since lengthNorm is higher for shorter field values, for B to have a higher fieldNorm value for the title, it must have smaller number of tokens in the title than A.

See the following pages for a detailed explanation of Lucene scoring:

http://lucene.apache.org/java/2_4_0/scoring.html http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html

KenE
+1 for lots of insight - thanks! Unfortunately, however, you'll notice in my post that I stated what the fields (and their lengths) are. Both objects have titles with 3 tokens and descriptions with 18 tokens. ObjectA's title has 2/3 tokens matching, ObjectB has 1/3 matching, and the matching descriptions are respectively 6/18 and 4/18. So, if I understand what you're saying, the lengthNorm should not be having any effect. May I ask - how would I go about setting index-time boosts?
JMTyler
Sorry - I thought your example was made up and not the actual values. In that case you are right in that field length shouldn't be a factor. You can set boosts in Solr in a variety of ways - If you are using SolrJ, I believe there is a "setBoost" method on the SolrInputDocument. But if Doc B was getting a boost the fieldNorm should be higher in the description field as well. You also might want to check out Luke - it allows you to reconstruct your indexed field data so you can see what really gets indexed.
KenE
Nope, not made up - just testing data. :) I'll take a look at the code and see if anything suspicious is happening with index-time boosts. I'll probably also check out Luke. Thanks for the help.
JMTyler
+2  A: 

The problem is caused by the stemmer. It expands "apples are apples" to "apples appl are apples appl" thus making the field longer. As document B only contains 1 term that is being expanded by the stemmer the field stays shorter then document A.

This results in different fieldNorms.

Jem
Could you elaborate, or possibly provide a link? Why would the "stemmer" be expanding my field to something that it *isn't*? That seems counter-intuitive! :)
JMTyler
Unless the first "appl" you wrote was supposed to be "apple"? Having just looked into stemming, that would make sense, if "apples" is being broken down into its root form. So - let me know if I have this right - you're saying that if I change all references to "apple" and search for "apple" only, I should get the results in the order I want?
JMTyler
I edited my post, so it should be clearer now. The stemmer uses "appl" as root form for "apple" and "apples". So if you disable stemming you should get the result you expect. You can also exclude terms from being stemmed by adding them to protwords.txt and change the schema.xml<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
Jem