How to build a 'related questions' engine?

views:

168

answers:

+4 Q:

How to build a 'related questions' engine?

One of our bigger sites has a section where users can send questions to the website owner which get evaluated personally by his staff. When the same question pops up very often they can add this particular question to the Faq.

In order to prevent them from receiving dozens of similar questions a day we would like to provide a feature similar to the 'Related questions' on this site (stack overflow).

What ways are there to build this kind of feature? I know that i should somehow evaluate the question and compare it to the questions in the faq but how does this comparison work? Are keywords extracted and if so how?

Might be worth mentioning this site is built on the LAMP stack thus these are the technologies available.

Thanks!

+2 A:

I don't know how Stack Overflow works, but I guess that it uses the tags to find related questions. For example, on this question the top few related questions all have the tag recommendation-engine. I would guess that the matches on rarer tags count for more than matches on common tags.

You might also want to look at term frequency–inverse document frequency.

Mark Byers 2010-02-02 08:26:51

And probably the size of the intersection between the sets of tags.

jensgram 2010-02-02 08:29:45

You can use spell-checking, where the corpus is the titles/text of the existing FAQ entries:

http://stackoverflow.com/questions/41424/how-do-you-implement-a-did-you-mean/258290#258290

Will 2010-02-02 08:29:56

+2 A:

If you wanted to build something like this yourself from scratch, you'd use something called TF/IDF: Term Frequency / Inverse document frequency. That means, to simplify it enormously, you find words in the query that are uncommon in the corpus as a whole and find documents that have those words.

In other words, if someone enters a query with the words "I want to buy an elephant" in it, then of the words in the query, the word "elephant" is probably the least common word in your corpus. "Buy" is probably next. So you rank documents (in your case, previous queries) by how much they contain the word "elephant" and then how much they contain the word "buy". The words "I", "to" and "an" are probably in a stop-list, so you ignore them altogether. You rank each document (previous query, in your case) by how many matching words there are (weighting according to inverse document frequency -- i.e. high weight for uncommon words) and show the top few.

I've oversimplified, and you'd need to read up on this to get it right, but it's really not terribly complicated to implement in a simple way. The Wikipedia page might be a good place to start:

http://en.wikipedia.org/wiki/Tf%E2%80%93idf

Ben 2010-02-02 08:48:02

+1 A:

Given you're working in a LAMP stack, then you should be able to make good use of MySQL's Fulltext search functions. Which I believe work on the TF-IDF principals, and should make it pretty easy to create the 'related questions' that you want.

Vex 2010-02-02 10:29:59

+1 A:

There's a great O'Reilly book - Programming Collective Intelligence - which covers group discovery, recommendations and other similar topics. From memory the examples are in Perl, but I found it easy to understand coming from a PHP background and within a few hours had built something akin to what you're after.

Yahoo has a keyword extractor webservice at http://developer.yahoo.com/search/content/V1/termExtraction.html

adam 2010-02-02 10:49:00

ansaurus

tags:

views:

answers:

How to build a 'related questions' engine?

related questions