ansaurus

Question

Solr: fieldNorm different per document, with no document boost

Answer 1

+1 A:

FieldNOrm is computed of 3 components - index-time boost on the field, index-time boost on the document and field length. Assuming that you are not supplying any index-time boost, the difference must be field length.

Thus, since lengthNorm is higher for shorter field values, for B to have a higher fieldNorm value for the title, it must have smaller number of tokens in the title than A.

See the following pages for a detailed explanation of Lucene scoring:

http://lucene.apache.org/java/2_4_0/scoring.html http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html

KenE 2010-06-23 17:35:13

+1 for lots of insight - thanks! Unfortunately, however, you'll notice in my post that I stated what the fields (and their lengths) are. Both objects have titles with 3 tokens and descriptions with 18 tokens. ObjectA's title has 2/3 tokens matching, ObjectB has 1/3 matching, and the matching descriptions are respectively 6/18 and 4/18. So, if I understand what you're saying, the lengthNorm should not be having any effect. May I ask - how would I go about setting index-time boosts?

JMTyler 2010-06-23 17:47:06

Sorry - I thought your example was made up and not the actual values. In that case you are right in that field length shouldn't be a factor. You can set boosts in Solr in a variety of ways - If you are using SolrJ, I believe there is a "setBoost" method on the SolrInputDocument. But if Doc B was getting a boost the fieldNorm should be higher in the description field as well. You also might want to check out Luke - it allows you to reconstruct your indexed field data so you can see what really gets indexed.

KenE 2010-06-23 18:03:03

Nope, not made up - just testing data. :) I'll take a look at the code and see if anything suspicious is happening with index-time boosts. I'll probably also check out Luke. Thanks for the help.

JMTyler 2010-06-23 18:45:42

Answer 2

+2 A:

The problem is caused by the stemmer. It expands "apples are apples" to "apples appl are apples appl" thus making the field longer. As document B only contains 1 term that is being expanded by the stemmer the field stays shorter then document A.

This results in different fieldNorms.

Jem 2010-06-23 19:08:14

Could you elaborate, or possibly provide a link? Why would the "stemmer" be expanding my field to something that it *isn't*? That seems counter-intuitive! :)

JMTyler 2010-06-23 19:49:11

Unless the first "appl" you wrote was supposed to be "apple"? Having just looked into stemming, that would make sense, if "apples" is being broken down into its root form. So - let me know if I have this right - you're saying that if I change all references to "apple" and search for "apple" only, I should get the results in the order I want?

JMTyler 2010-06-23 20:05:39

I edited my post, so it should be clearer now. The stemmer uses "appl" as root form for "apple" and "apples". So if you disable stemming you should get the result you expect. You can also exclude terms from being stemmed by adding them to protwords.txt and change the schema.xml<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>

Jem 2010-06-23 20:56:22

ansaurus

tags:

views:

answers:

Solr: fieldNorm different per document, with no document boost

related questions