views:

276

answers:

2

Hello

I am using SQL Server 2008 Full Text Search, and joining to the FreeTextTable to determine ranking of results.

How do I determine whether the result set is giving an accurate match or not?

For example, for one search I may get these results:

Manufacturer | Rank
===================

LG U300 ------- 102
LG C1100 ------ 54
LG GT505 ------ 18
LG KF300 ------ 18
LG Callisto --- 18
...

The spread of the rank range suggests that one result is overwhelmingly more relevant than all of the other results, indicating that the top result is most likely an accurate match for the search term.

But for another search I may get this result:

Manufacturer | Rank
===================

LG C1100------- 33
LG GC900 ------ 31
LG GT500 ------ 31
LG KC910 ------ 31
LG KF310 ------ 31
...

The lack of spread of the rank range in this result set indicates an inacurate search result.

How can I output a boolean value as an extra column in the results that indicates whether the spread of the rank suggests that results are accurate or not?

Thank you!

A: 

You could of course use the variance as an indicator of "spread", however I don't think this is the right approach. (Especially if you look at the first n results only.)

Relevance is a big thing in information retrieval. It depends on the ranking method and also on the probability of a search term to occur as well as on the relevance of other search terms. Something you could do:

Calculate the expected occurences (mean-count) of a search term in a random document. Then compare the number of occurences in the returned result. Your ranking will then be counts-in-my-doc / mean-count. A document is relevant if the result of this is significantly higher than 1.

bayer
A: 

Calculate the percentage difference between the median value of the result set and the top ranked value. The bigger the result the more likely the match accuracy.

For the first result set : (102-18)/102 = 82.35%.

Then set a baseline in the code - e.g if spread is greater than 40% then it is likely row 1 contains an accurate result. Do some tests on various searches to determine the baseline value.