views:

144

answers:

7

Hi!

My search engine uses the following function to calculate relevancy.

private static int calculateScore(String result, String searchStr, int modifier) 
{
    String[] resultWords = result.split(" ");
    String[] searchWords = searchStr.split(" ");

    int score = 0;
    for (String searchWord : searchWords) 
    {
     for (String resultWord : resultWords) 
     {
      if (resultWord.equals(searchWord))
       score += 10;
      else if (resultWord.startsWith(searchWord))
       score += 4;
      else if (resultWord.endsWith(searchWord))
       score += 3;
      else if (resultWord.contains(searchWord))
       score += 1;
     }

    }
    return score;
}

Nothing fancy, and I haven't been given enough hours to do anything fancy either, but are there any simple improvements I can do to make the function better at upping the relevant stuff, and keeping the irrelevant stuff down? No need to remark on speed optimizations, this is just the "functional part" of the function :)

Thanks.

+9  A: 

Not sure if it counts as fancy, but a soundex comparison, presumably earning a +1 score on your scale, will grant a little relevance to typographical near misses and homophones.

I'd suggest using a stop word list to either prevent or radically reduce relevancy granted from common words. If someone is searching for "the horse is on the roof", you want to grant relevancy to "horse" and "roof", not massively upscore everything containing "the".

Another easy boost is to grant a whole bunch of relevancy to a result that contains the entire search string in order (case and punctuation insensitive).

chaos
+5  A: 

The classic vector space model is actually standard nice and rather simple (in implementation) solution.

http://www.miislita.com/term-vector/term-vector-3.html

Roman
+1. VSM yields far superior results to ad-hoc solutions.
Emil H
Wow, that made my head spin. I wish I was more mathematically inclined :)
Ace
Actually, this looks pretty cool. I'll make this the accepted answer if someone can explain it to a math-imbecile like me. =)
Ace
+2  A: 

You should also probably normalize case before you do your equals.

Steve B.
A: 

Obvious, but you'd want to lowercase everything.

MSalters
+2  A: 

You could add +1 to the score if the Levenshtein Distance is below some threshold.

Matt Kane
+2  A: 

If you want to make your search algorithm a little bit more sophisticated, look at cosine similarity. It's pretty easy to implement and works pretty well in practice.

neesh
AKA the vector space model.
erickson
+1  A: 

You could:

  • Cull out noise words (the, a, at, etc...)
  • Change weight for searchable text relevancy - if a search word occurs in a document's title and its body, the title is probably more relevant.
  • Depending on your resultWord text, consider adding weight for words earlier or later in the text.
  • Make unique search word matches worth more (you'd have to determine what makes a word unique).
Corbin March