views:

194

answers:

5

I was going to Ask a Question earlier today when I was presented to a surprising functionality in Stackoverflow. When I wrote my question title stackoverflow suggested me several related questions and I found out that there was already two similar questions. That was stunning!

Then I started thinking how I would implement such function. How I would order questions by relatedness:

  1. Question that have higher number of words matchs with the new question
  2. If the number of matchs are the same, the order of words is considered
  3. Words that appears in the title has higher relevancy

That would be a simple workflow or a complex score algortithm? Some stemming to increase the recall, maybe? Is there some library the implements this function? What other aspects would you consider? Maybe Jeff could answer himself! How did you implemented this in Stackoverflow? :)

A: 

Isn't StackOverflow going to be open sourced at some point? If so, you can always find out how they did it there.

Update: It appears that they say they might open source it. I hope they do.

Paul Tomblin
+3  A: 

One such way to implement such an algorithm would involve ranking the questions as per a heuristic function which assigns a 'relevance' weight factor using the following steps:

  1. Apply a noise filter to the 'New' question to remove words that are common across a large number of objects such as: 'the', 'and', 'or', etc.
  2. Get the number of words contained in the 'New' question which match the words the set of questions already posted on the website. [A]
  3. Get the number of tag matches between the words in the 'New' question and the available. [B]
  4. Compute the 'relevance weight' based on [A] and [B] as 'x[A] + y[B]', where x and y are weight multipliers (Assign a higher weight multiplier to [B] as tagging is more relevant than simple word search)
  5. Get the top 5 questions which have the highest 'relevance weight'.

The heuristic might require tweaking to get optimal results, but it should work.

Pascal
A: 

Thanks Pascal! Do you know of any API that implements this in a configurable way? Does lucene provides this functionality?

Marcio Aguiar
A: 

@marcio

Sorry, I am not aware of any direct API reference that I could suggest here and I have never worked with Lucene.

However, I am aware that Google Desktop uses a Query API to rank and suggest the relevant search results. More information on the API can be found here.

Perhaps others could chime in and guide you.

Pascal
+1  A: 

Your question seems similar to this one, which has some additional answers.

robaker