views:

165

answers:

2

My Question is, how do I compare a group of tags to another posts tags in my database to get related posts?

To start off, I've searched the internet and stackoverflow to find out how to do this. My closest find was this post http://stackoverflow.com/questions/2153062/how-to-find-related-items-in-php. But it actually doesn't solve much for me.

What I'm trying to do is compare a group of tags on a post to another post's tags, but not each tag individually. So say you wanted to get truly related items based on tags from a post and then show them from the most related to the least related. Each time there has to be 3 related items shown though no matter the relationship level.

Post A has the tags: "architecture", "wood", "modern", "switzerland"
Post B has the tags: "architecture", "wood", "modern"
Post C has the tags: "architecture", "modern", "stone"
Post D has the tags: "architecture", "house", "residence"

Post B is related to post A by 75% (3 related tags)
Post C is related to post A by 50% (2 related tags)
Post D is related to post A by 25% (1 related tag)

How can I do that? I'm currently using a 3-tables.

posts
> id
> image
> date

post_tags
> post_id
> tag_id

tags
> id
> name
+1  A: 

NOTE: This solution is MySQL only, as MySQL has its own interpretation of GROUP BY

I've also used my own calculation of similarity. I've taken the number of identical tags and divided it by the average tag count in post A and post B. So if post A has 4 tags, and post B has 2 tags which are both shared with A, the similarity is 66%.

(SHARED:2 / ((A:4 + B:2)/2) or (SHARED:2) / (AVG:3)

It should be easy to change the formula if you want/need to...

SELECT
 sourcePost.id,
 targetPost.id,

 /* COUNT NUMBER OF IDENTICAL TAGS */
 /* REF GROUPING OF sourcePost.id and targetPost.id BELOW */
 COUNT(targetPost.id) /
 (
  (
   /* TOTAL TAGS IN SOURCE POST */
   (SELECT COUNT(*) FROM post_tags WHERE post_id = sourcePost.id)

   +

   /* TOTAL TAGS IN TARGET POST */
   (SELECT COUNT(*) FROM post_tags WHERE post_id = targetPost.id)

  ) / 2  /* AVERAGE TAGS IN SOURCE + TARGET */
 ) as similarity
FROM
 posts sourcePost
LEFT JOIN
 post_tags sourcePostTags ON (sourcePost.id = sourcePostTags.post_id)
INNER JOIN
 post_tags targetPostTags ON (sourcePostTags.tag_id = targetPostTags.tag_id
                             AND 
                              sourcePostTags.post_id != targetPostTags.post_id)
LEFT JOIN
 posts targetPost ON (targetPostTags.post_id = targetPost.id)
GROUP BY
 sourcePost.id, targetPost.id
Ivar Bonsaksen
Ivar, I really appreciate your help—this works very nicely. I've changed the grouping slightly to organize the results more so.The one thing I'm wondering is if there is someway that, if there are less than 3 results, get a random set of items from the database?
stwhite
@Ivar Bonsaksen: Any solution for PostgreSql?
takeshin
A: 

Put the tags into an array. Each array being respectively called Post A / Post B etc. Then use array_diff_assoc(), to figure out how different the arrays are.

But really, Ivars solution would work better, this is easier to understand though :)

James Eggers
This is definitely easier to understand and is my fallback solution if I have to. I'm now trying to check that if there are less than 3 results returned, is there a way (in the query) to get a random set of items from the database?
stwhite