views:

393

answers:

4

We may tag a question with multiple tags in StackOverflow website, I'm wondering how to find out the most related questions with common tags.

Assume we have 100 questions in a database, each question has several tags. Let's say user is browsing a specific question, and we want to make the system to display the related questions on the page. The criteria for related question is they have most common tags.

For example: Question 1 is tagged with AAA, BBB, CCC, DDD, EEE.

Question 2 is top 1 related because it also has all those 5 tags. Question 3 is top 2 related because it has only 4 or 3 tags that Questio1 has. ......

So my question is how to design the database and find out the questions that's related to Question 1 quickly. Thank you very much.

A: 

Not entirely sure what you mean, but the Tags page lists tags in order of popularity (as in amount tagged).

Edit: is this about SO or about your own application? If it is about your own app, remove the SO tag as it's kind of misleading.

Edit2: I'd say something like:

SELECT * FROM `questions` WHERE `tag` LIKE '%tagname%' OR (looped for each tag) LIMIT 5,0

Where 5 is the maximum results you want to return (for at least some optimisation). Probably not the best solution, but I could see it working.

You might also want to try a LIKE match using the title.

Ross
A: 

Sorry for the confusion. I'm asking the question for my own application.

Assume we have 100 questions in a database, each question has several tags. Let's say user is browsing a specific question, and we want to make the system to display the related questions on the page. The criteria for related question is they have most common tags.

For example:
Question 1 is tagged with AAA, BBB, CCC, DDD, EEE.

Question 2 is top 1 related because it also has all those 5 tags.
Question 3 is top 2 related because it has only 4 or 3 tags that Questio1 has.
......

So my question is how to design the database and find out the questions that's related to Question 1 quickly. Thank you very much.

+6  A: 

Perhaps something like:

select qt.question_id, count(*)
from   question_tags qt
where  qt.tag in
( select qt2.tag
  from   question_tags qt2
  where  qt2.question_id = 123
)
group by qt.question_id
order by 2 desc
Tony Andrews
+2  A: 

If you can guarantee that there are not duplicate tags for a question, then you can do the following:

SELECT
     QT2.question_id,
     COUNT(*) AS cnt
FROM
     Question_Tags QT1
INNER JOIN Question_Tags QT2 ON QT2.tag = QT1.tag AND QT2.question_id <> QT1.question_id
WHERE
     QT1.question_id = @question_id
GROUP BY
     QT2.question_id
ORDER BY
     cnt DESC

If you can't guarantee uniqueness of tags within a question, then Tony Andrews' solution will work. His will work in any event, but you should compare performance on your system with this method if you can make the guarantee of uniqueness through constraints.

Tom H.