views:

34

answers:

1

Hi there,

I am trying to update the searching of terms of documents within my Lucene index. Currently the searches score on the number of times the term appears in the document. What I would like to do is score if the term exists, rather than the number of times the term exists. So a document with the term in it once scores the same as a document with the term in it 100 times.

I've tried to extend the Zend_Search_Lucene_Search_Similarity with my own class, but to be honest I am not sure if this is working correctly as the scores are still quite low.

class MySimilarity extends Zend_Search_Lucene_Search_Similarity{

//override the default frequency of searching
public function tf($freq){
    return 1.0; 
}

public function lengthNorm($fieldName, $numTerms) {
    return 1.0/sqrt($numTerms);
}

public function queryNorm($sumOfSquaredWeights) {
    return 1.0/sqrt($sumOfSquaredWeights);
}

public function sloppyFreq($distance) {
    return 1.0;
}

public function idfFreq($docFreq, $numDocs) {
    return log($numDocs/(float)($docFreq+1)) + 1.0;
}

public function coord($overlap, $maxOverlap) {
    return $overlap/(float)$maxOverlap;
}
}

Now this is built from examples I have found when searching good old google. However the only real change I've done has been to the tf() function.

Any help with this and I would be really greatful as at the moment it's really messing up my searches.

Thanks,

Grant

A: 

I would try two things to debug this:

  1. Build a really small index - two documents, a single field in each, the first having the word "boat", and the second the phrase "boat boat". Test your search on that.
  2. Try to override only the tf() function. This is the change you want. Overriding other parts, such as the norm, requires reindexing using the new similarity function. Make sure you actually need this before reindexing.

Overall, changing the tf() function seems the right thing to do. This, provided that you only want a relative order and do not care about the absolute score.

Yuval F
What would be the best way of getting the absolute score? Would it be idfFreq()??? Thanks, Grant
Grant Collins
Why you need the absolute score? I suggest you read http://lucene.apache.org/java/2_4_0/scoring.htmlandhttp://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.htmlJava Lucene has a handy explain() function that describes why a document got its score. I couldn't find one in Zend, but you may have better luck.Anyway, for search, you only need the proper order of documents, hence the relative score is the important one.
Yuval F
Thank Yuval, those documents pointed me in the right direction.
Grant Collins