views:

69

answers:

3

I'm trying to find the most efficient way of dealing with this but I must tell you front-head I've made a mess of it. Looked around SO and found nothing of relevance so here it goes.

How to select all projects that have similar tags to the desired project?

Take this table for example:
(sql code to recreate tables bellow)

project 1 -> tagA | tagB | tagC
project 2 -> tagA | tagB
project 3 -> tagA
project 4 -> tagC

Selecting project 1 should return back all projects.
Selecting project 4 should only return project project 1

My query so far is pretty dependant of left joins and for sure there is a better way to do this:

SELECT all_tags.project_id, all_tags.tag_id, final.title, tag.tag
FROM projects AS p
LEFT JOIN projects_to_tags AS pt ON p.num = pt.project_id
LEFT JOIN projects_to_tags AS all_tags ON pt.tag_id = all_tags.tag_id
LEFT JOIN projects AS final ON all_tags.project_id = final.num
LEFT JOIN tags AS tag ON all_tags.tag_id = tag.tag_id
WHERE p.num = 4
GROUP BY final.num

Thank you all for the input. I though I'd share with you guys the average results of all the queries on a 100k projects database, 100k tags database with a 100k projects_to_tags relation. All queries were changed to ask for project_1.

The sweet and short:

0.0160 sec - OMG Ponies - Using JOINS  
0.0208 sec - jdelard  
0.2581 sec - OMG Ponies - Using EXISTS  
0.2777 sec - OMG Ponies - Using IN  
0.5295 sec - Emtucifor - updated query  
0.5088 sec - Emtucifor - first query  

Thank you all very much for this. Gonna update ALL my queries accordingly.

Here go all queries and respective MySQL EXPLAIN's along with time's

===============================================================================================================================================
Emtucifor - updated query
===============================================================================================================================================
Showing rows 0 - 1 (2 total, Query took 0.5295 sec)
SELECT * 
FROM projects AS L
WHERE L.num !=1-- instead of <> PT2.project_id inside

AND EXISTS (

SELECT 1 
FROM projects_to_tags PT
INNER JOIN projects_to_tags PT2 ON PT.tag_id = PT2.tag_id
WHERE L.num = PT.project_id
AND PT2.project_id =1
)
LIMIT 0 , 30

id  select_type table   type    possible_keys   key key_len ref rows    Extra
1   PRIMARY L   ALL PRIMARY NULL    NULL    NULL    100000  Using where
2   DEPENDENT SUBQUERY  PT2 ref project_id  project_id  4   const   1   Using index
2   DEPENDENT SUBQUERY  PT  ref project_id  project_id  8   test.L.num,test.PT2.tag_id  12000   Using index




===============================================================================================================================================
Emtucifor - first query
===============================================================================================================================================
Showing rows 0 - 1 (2 total, Query took 0.5088 sec)
SELECT * 
FROM projects AS L
WHERE 
EXISTS (

SELECT 1 
FROM projects_to_tags PT
INNER JOIN projects_to_tags PT2 ON PT.tag_id = PT2.tag_id
WHERE L.num = PT.project_id
AND PT2.project_id =1
AND PT2.project_id <> L.num
)
LIMIT 0 , 30

id  select_type table   type    possible_keys   key key_len ref rows    Extra
1   PRIMARY L   ALL NULL    NULL    NULL    NULL    100000  Using where
2   DEPENDENT SUBQUERY  PT2 ref project_id  project_id  4   const   1   Using index
2   DEPENDENT SUBQUERY  PT  ref project_id  project_id  8   test.L.num,test.PT2.tag_id  12000   Using where; Using index




===============================================================================================================================================
jdelard
===============================================================================================================================================
Showing rows 0 - 1 (2 total, Query took 0.0208 sec)
SELECT p.num, p.title
FROM projects_to_tags pt1, projects_to_tags pt2, projects p
WHERE pt1.project_id =1
AND pt2.project_id !=1
AND pt1.tag_id = pt2.tag_id
AND p.num = pt2.project_id
GROUP BY pt2.project_id
LIMIT 0 , 30

id  select_type table   type    possible_keys   key key_len ref rows    Extra
1   SIMPLE  pt1 ref project_id  project_id  4   const   1   Using index; Using temporary; Using filesort
1   SIMPLE  pt2 index   project_id  project_id  8   NULL    75001   Using where; Using index
1   SIMPLE  p   eq_ref  PRIMARY PRIMARY 4   test.pt2.project_id 1    




===============================================================================================================================================
OMG Ponies - Using IN
===============================================================================================================================================
Showing rows 0 - 2 (3 total, Query took 0.2777 sec)
SELECT p . * 
FROM projects p
JOIN projects_to_tags pt ON pt.project_id = p.num
WHERE pt.tag_id
IN (

SELECT x.tag_id
FROM projects_to_tags x
WHERE x.project_id =1
)
LIMIT 0 , 30

id  select_type table   type    possible_keys   key key_len ref rows    Extra
1   PRIMARY pt  index   project_id  project_id  8   NULL    100001  Using where; Using index
1   PRIMARY p   eq_ref  PRIMARY PRIMARY 4   test.pt.project_id  1    
2   DEPENDENT SUBQUERY  x   ref project_id  project_id  8   const,func  12000   Using where; Using index




===============================================================================================================================================
OMG Ponies - Using EXISTS
===============================================================================================================================================
Showing rows 0 - 2 (3 total, Query took 0.2581 sec)
SELECT p . * 
FROM projects p
JOIN projects_to_tags pt ON pt.project_id = p.num
WHERE EXISTS (

SELECT NULL 
FROM projects_to_tags x
WHERE x.project_id = 1
AND x.tag_id = pt.tag_id
)
LIMIT 0 , 30




===============================================================================================================================================
OMG Ponies - Using JOINS
===============================================================================================================================================
Showing rows 0 - 2 (3 total, Query took 0.0160 sec)
SELECT DISTINCT p . * 
FROM projects p
JOIN projects_to_tags pt ON pt.project_id = p.num
JOIN projects_to_tags x ON x.tag_id = pt.tag_id
AND x.project_id = 1
LIMIT 0 , 30

id  select_type table   type    possible_keys   key key_len ref rows    Extra
1   SIMPLE  x   ref project_id  project_id  4   const   1   Using index; Using temporary
1   SIMPLE  pt  index   project_id  project_id  8   NULL    75001   Using where; Using index
1   SIMPLE  p   eq_ref  PRIMARY PRIMARY 4   test.pt.project_id  1   

SQL code to copy/paste and mess around.

CREATE TABLE IF NOT EXISTS `projects` (
  `num` int(2) NOT NULL auto_increment,
  `title` varchar(30) NOT NULL,
  PRIMARY KEY  (`num`)
) ENGINE=MyISAM  DEFAULT CHARSET=latin1 AUTO_INCREMENT=5 ;


INSERT INTO `projects` (`num`, `title`) VALUES(1, 'project 1'),(2, 'project 2'),(3, 'project 3'),(4, 'project 4');


CREATE TABLE IF NOT EXISTS `projects_to_tags` (
  `project_id` int(2) NOT NULL,
  `tag_id` int(2) NOT NULL,
  KEY `project_id` (`project_id`,`tag_id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;


INSERT INTO `projects_to_tags` (`project_id`, `tag_id`) VALUES(1, 1),(1, 2),(1, 3),(2, 1),(2, 2),(3, 1),(4, 3);


CREATE TABLE IF NOT EXISTS `tags` (
  `tag_id` int(2) NOT NULL auto_increment,
  `tag` varchar(30) NOT NULL,
  PRIMARY KEY  (`tag_id`),
  UNIQUE KEY `tag` (`tag`)
) ENGINE=MyISAM  DEFAULT CHARSET=latin1 AUTO_INCREMENT=4 ;


INSERT INTO `tags` (`tag_id`, `tag`) VALUES(1, 'tag a'),(2, 'tag b'),(3, 'tag c');
+3  A: 

In any of the following cases, if you don't know the PROJECT.num/PROJECT_TO_TAGS.project_id, you'll have to join to the PROJECTS table to get the id value for finding out what tags it has associated.

Using IN

SELECT p.*
  FROM PROJECTS p
  JOIN PROJECTS_TO_TAGS pt ON pt.project_id = p.num
 WHERE pt.tag_id IN (SELECT x.tag_id
                       FROM PROJECTS_TO_TAGS x
                      WHERE x.project_id = 4)

Using EXISTS

SELECT p.*
  FROM PROJECTS p
  JOIN PROJECTS_TO_TAGS pt ON pt.project_id = p.num
 WHERE EXISTS (SELECT NULL
                 FROM PROJECTS_TO_TAGS x
                WHERE x.project_id = 4
                  AND x.tag_id = pt.tag_id)

Using JOINS (this the most efficient one!)

The DISTINCT is necessary because JOINs risk duplicated data turning up in the resultset...

SELECT DISTINCT p.*
  FROM PROJECTS p
  JOIN PROJECTS_TO_TAGS pt ON pt.project_id = p.num
  JOIN PROJECTS_TO_TAGS x ON x.tag_id = pt.tag_id
                         AND x.project_id = 4
OMG Ponies
@OMG Ponies thank you very much! Your SQL looks beautiful! Nevertheless all 3 possibilities require a full table scan. I was pretty much trying to avoid that (which I do on my spaghetti-looking SQL code)
Frankie
+2  A: 

Something like this... ?

SELECT *
FROM projects AS L
WHERE
   EXISTS (
      SELECT 1
      FROM
         projects_to_tags PT
         INNER JOIN projects_to_tags PT2 ON PT.tag_id = PT2.tag_id
      WHERE
         L.num = PT.project_id
         AND PT2.project_id = 4
         AND PT2.project_id <> L.num
   )

That's 2 seeks and a scan.

UPDATE

Taking a page from jdelard's book, one tiny modification switches my query to outperform his (of course I'm doing this on SQL Server meaning I took out his GROUP BY and put in a DISTINCT, so YMMV on MySQL):

SELECT *
FROM projects AS L
WHERE
   L.num != 4 -- instead of <> PT2.project_id inside
   AND EXISTS (
      SELECT 1
      FROM
         projects_to_tags PT
         INNER JOIN projects_to_tags PT2 ON PT.tag_id = PT2.tag_id
      WHERE
         L.num = PT.project_id
         AND PT2.project_id = 4
   )

The improvement over his query comes from not doing a DISTINCT or aggregate, and using a semi join instead of a complete join so not every row has to be joined. Otherwise, semantically they are largely the same.

I will have to remember jdelard's trick as it is a very useful tool. For some reason the query engine was not smart enough to compute that given {a = 4, a != b} then {b != 4}.

Emtucifor
@Emtucifor very nice, thanks! The full table scan put me a bit off and the projects table can (and will) grow to a pretty extensive size. I specially like non-showing the original project that, even though was not asked, was the intention. Thanks!
Frankie
+3  A: 

How about... (example for project 1)

SELECT p.num, p.title
FROM projects_to_tags pt1, projects_to_tags pt2, projects p
where pt1.project_id = 1 and 
      pt2.project_id != 1 and 
      pt1.tag_id = pt2.tag_id and 
      p.num = pt2.project_id 
group by pt2.project_id

And maybe add a separate index for tag_id in projects_to_tags so you can use it alone, instead of the composite. No more type ALL. (Table Scan) Replacing both 1 with 4 give also the desired results.

jdelard
@jdelard just perfect. Simple, elegant and very performance wise solution. Love it. Thanks!
Frankie
Please explain how MySql does a group by with columns in the select list that have no aggregate on them?
Emtucifor
Also, I didn't know anyone was still writing non-ANSI joins... in 2010!?!?! :)
Emtucifor
@Emtucifor: This is ANSI-89 join syntax, but I agree that theses days it should be ANSI-92 syntax. Also, "hidden" columns in the `GROUP BY` is documented behavior: http://dev.mysql.com/doc/refman/5.0/en/group-by-hidden-columns.html
OMG Ponies
@OMG Ponies - interesting. It's like a default First() or Any() aggregate.
Emtucifor