Generally, for this kind of many-to-many relationship, there are three tables :
- The "
article
" table
- The "
tag
" table
- primary key = id
- contains the data of each tag :
- A "
tags_articles
" table, which acts as a join table, and contains only :
id_article
: foreign key that points to an article
id_tag
: foreign key that points to a tag
This way, there is no duplication of any tag's data : for each tag, there is one, and only one, line in the tag
table.
And, for each article, you can have several tags (i.e. several lines in the tags_articles
table) ; and, of course, for each tags, you can have several articles.
Getting a list of tags for an article, with this idea, is a matter of an additionnal query, like :
select tag.*
from tag
inner join tags_articles on tag.id = tags_articles.id_tag
where tags_articles.id_article = 123
Getting the three "most similar" articles would mean :
- select articles that have tags that the first article has
- only use those which have the most important number of identical tags
Not tested, but an idea might be something that would look like this :
select article.id, count(*) as nb_identical_tags
from article
inner join tags_articles on tags_articles.id_article = article.id
inner join tag on tag.id = tags_articles.id_tag
where tag.name in ('php', 'mysql', 'erlang')
and article.id <> 123
group by article.id
order by count(*) desc
limit 3
Basically, you :
- select the articles ids for each tag that's present on your initial article
- as there's an inner join, if an article in the DB has 2 tags that match the
where
clause, without the group by
clause, there would be two lines for that article
- of course, you don't want to re-select the article you already had -- which means it has to be excluded.
- but, as you use
group by article.id
, there will be only one line per article
- but you'll be able to use
count
, to find out how many tags each article has in common with the initial one
- then, it's only a matter of sorting per number of tags, and getting only the third three lines.