views:

336

answers:

5

Lets take StackOverflow questions as example. Each of them has multiple tags assigned. How to build an algorithm that would find related questions based on how many common tags they have (sorted by number of common tags)?

For now I can't think about anything better than just selecting all questions that have at least one common tag into an array and then looping through them all assigning number of common tags to each item, then sorting this array.

Is there more clever way of doing it? Perfect solution would be a single sql query.

+4  A: 

This could be as bad as O(n^2), but it works:

create table QuestionTags (questionid int, tag int);

select q1.questionid, q2.questionid, count(*) as commontags
from QuestionTags q1 join QuestionTags q2 
where q1.tag = q2.tag and q1.questionid < q2.questionid
group by q1.questionid, q2.questionid order by commontags desc;
Keith Randall
Thank you for this, I learned something about SQL from reading this elegant example.
kaleidomedallion
+1  A: 

I didn't have time to optimize the WHERE IN() clause which is slow as death. I also didn't both to include indexes or specify the engine or collations but this should hopefully do the trick. Please point out any mistakes as I have not actually tested this sloppy code:

CREATE TABLE question (
    id INT(11) UNSIGNED NOT NULL PRIMARY KEY AUTO_INCREMENT,
    title VARCHAR(255) NOT NULL,
    question MEDIUMTEXT
);

CREATE TABLE question_rel_tag (
    id INT(11) UNSIGNED NOT NULL PRIMARY KEY AUTO_INCREMENT,
    question_id INT(11) UNSIGNED NOT NULL,
    tag_id INT(11) UNSIGNED NOT NULL
);

CREATE TABLE tag (
    id INT(11) UNSIGNED NOT NULL PRIMARY KEY AUTO_INCREMENT,
    name VARCHAR(20) NOT NULL
);

SELECT question.id, question.title, question.question, count(tag.id) AS `count`
FROM question
INNER JOIN question_rel_tag ON (question.id = question_rel_tag.question_id)
INNER JOIN tag ON (question_rel_tag.tag_id = tag.id)
WHERE question.id != YOUR_QUESTION_ID_HERE
AND tag.id IN 
    (SELECT tag.id FROM question
    INNER JOIN question_rel_tag ON (question.id = question_rel_tag.question_id)
    INNER JOIN tag ON (question_rel_tag.tag_id = tag.id)
    WHERE question.id = YOUR_QUESTION_ID_HERE)
GROUP BY tag.id
ORDER BY `count` DESC
cballou
+1  A: 

Assuming a table Questions with a primary key column Id, a table Tags with a primary key column Id, too, and a join table QuestionTags with a composite primary key QuestionId and TagId referencing the primary keys of the two former tables, the following query will give the desired result (in SQL Server 2005).

SELECT q1.Id AS Id1, q2.Id AS Id2,
   (SELECT COUNT(*) FROM QuestionTags qt1 INNER JOIN QuestionTags qt2
    ON qt1.QuestionId = q1.Id AND qt2.QuestionId = q2.Id AND qt1.TagId = qt2.TagId) AS TagCount
FROM Questions q1 INNER JOIN Questions q2 ON q1.Id < q2.Id
ORDER BY TagCount DESC

This can be improved to the following.

SELECT qt1.QuestionId AS Id1, qt2.QuestionId AS Id2, COUNT(*) AS TagCount
FROM QuestionTags qt1 INNER JOIN QuestionTags qt2
ON qt1.QuestionId < qt2.QuestionId AND qt1.TagId = qt2.TagId
GROUP BY qt1.QuestionId, qt2.QuestionId
ORDER BY TagCount DESC
Daniel Brückner
+1  A: 

Suppose I have the following:

A table of taggable entities:

Taggable

id, stuff

A table of tags:

Tag

id, tagName

A join table of what tags are associated with what Taggables

TagInfo

taggableId tagId

Then :

SELECT COUNT(Taggable_1.id) AS score, Taggable_1.id FROM dbo.Taggable INNER JOIN dbo.TagInfo ON dbo.Taggable.id = dbo.TagInfo.taggableId INNER JOIN dbo.TagInfo TagInfo_1 ON dbo.TagInfo.tagId = TagInfo_1.tagId INNER JOIN dbo.Taggable Taggable_1 ON TagInfo_1.taggableId = Taggable_1.id WHERE (dbo.Taggable.id = 1) AND (Taggable_1.id <> 1) GROUP BY Taggable_1.id

would join the tag-join table against itself (for the queried question id; 1 in the above sql) and count up the results for a score.

Mikeb
+1  A: 
select *
     , count(t.*) as matched_tag_count
  from questions q
  join tags_to_questions t
    on t.question_id = q.id
   and t.tag_id in (select tag_id
                      from tags_to_questions
                     where question_id = ?)
group
    by q.id
order
    by matched_tag_count desc
longneck