views:

90

answers:

2

Hi, I've following 3 Documents in Lucene index.

  1. As MBA you will play an integral role in implementing the strategy of the business and will have the responsibilities of the statutory accounts, compliance, audit including banking relationships, tax, treasury & cash management

  2. As M.B.A. you will play an integral role in implementing the strategy of the business and will have the responsibilities of the statutory accounts, compliance, audit including banking relationships, tax, treasury & cash management

  3. As Master of Business Administration you will play an integral role in implementing the strategy of the business and will have the responsibilities of the statutory accounts, compliance, audit including banking relationships, tax, treasury & cash management

My search input is :MBA and the query I search execute on Lucene is:

+((description:mba^3.0) (description:m.b.a.) (description:\"master business administration\"))

I get results in following sequence after sorting results by score in descending order:

Document # 3
Document # 2
Document # 1

Shouldn't Record # 1 come on top of search results since I've given it a higher boost and also that document contains the exact word MBA??

What am i missing here?

Thanks.

+3  A: 

The matching query string makes up about 10% of the content of Doc#3. but only a tiny fraction of #1 and #2.

You might have to adjust your boosts to reflect the different lengths of the alternative query strings.

RichieHindle
thanks for ur comment...?could you please let me know how much final search query should look like?
Ed
@Ed: What I'm saying is that there's another boost factor to take into account, which is the ratio of the length of your alternative query to the original. So because "master business administration" is 10 times longer than "MBA", you should be reducing its boost to something like one tenth: `(description:\"master business administration\"^0.1)`. That will eliminate the bias introduced by its extra length. (It may be that `^0.1` is too much - you'll need to experiment to determine the right relationship between the length ratio and the boost factor needed to offset it.)
RichieHindle
Thanks a ton Richie!This logic worked for me!
Ed
A: 

If you are using Lucene's StandardAnalyzer, docs #1 and #2 are actually equivalent and both will match the term "mba". It's hard to guess why #3 has the highest score - maybe because it matches multiple terms. You may want to consider handling synonyms like this at index time.

I wouldn't guess the field length would be a big factor, but what you probably want to do is use IndexSearcher.Explain() to get a breakdown of the scoring - that the best way to debug problems like this one.

KenE