views:

154

answers:

7

I have the following data structure and data:

CREATE TABLE `parent` (
  `id` int(11) NOT NULL auto_increment,
  `name` varchar(10) NOT NULL,
  PRIMARY KEY  (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

INSERT INTO `parent` VALUES(1, 'parent 1');
INSERT INTO `parent` VALUES(2, 'parent 2');

CREATE TABLE `other` (
  `id` int(11) NOT NULL auto_increment,
  `name` varchar(10) NOT NULL,
  PRIMARY KEY  (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

INSERT INTO `other` VALUES(1, 'other 1');
INSERT INTO `other` VALUES(2, 'other 2');

CREATE TABLE `relationship` (
  `id` int(11) NOT NULL auto_increment,
  `parent_id` int(11) NOT NULL,
  `other_id` int(11) NOT NULL,
  PRIMARY KEY  (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

INSERT INTO `relationship` VALUES(1, 1, 1);
INSERT INTO `relationship` VALUES(2, 1, 2);
INSERT INTO `relationship` VALUES(3, 2, 1);

I want to find the the parent records with both other's 1 & 2.

This is what I've figured out, but I'm wondering if there is a better way:

SELECT p.id, p.name
FROM parent AS p
    LEFT JOIN relationship AS r1 ON (r1.parent_id = p.id)
    LEFT JOIN relationship AS r2 ON (r2.parent_id = p.id)
WHERE r1.other_id = 1 AND r2.other_id = 2;

The result is 1, "parent 1" which is correct. The problem is that once you get a list of 5+ joins, it gets messy and as the relationship table grows, it gets slow.

Is there a better way?

I'm using MySQL and PHP, but this is probably pretty generic.

+4  A: 

Ok, I tested this. The queries from best to worst were:

Query 1: Joins (0.016s; basically instant)

SELECT p.id, name
FROM parent p
JOIN relationship r1 ON p.id = r1.parent_id AND r1.other_id = 100
JOIN relationship r2 ON p.id = r2.parent_id AND r2.other_id = 101
JOIN relationship r3 ON p.id = r3.parent_id AND r3.other_id = 102
JOIN relationship r4 ON p.id = r4.parent_id AND r4.other_id = 103

Query 2: EXISTS (0.625s)

SELECT id, name
FROM parent p
WHERE EXISTS (SELECT 1 FROM relationship WHERE parent_id = p.id AND other_id = 100)
AND EXISTS (SELECT 1 FROM relationship WHERE parent_id = p.id AND other_id = 101)
AND EXISTS (SELECT 1 FROM relationship WHERE parent_id = p.id AND other_id = 102)
AND EXISTS (SELECT 1 FROM relationship WHERE parent_id = p.id AND oth

Query 3: Aggregate (1.016s)

SELECT p.id, p.name FROM parent p WHERE (SELECT COUNT(*) FROM relationship WHERE parent_id = p.id AND other_id IN (100,101,102,103))

Query 4: UNION Aggregate (2.39s)

SELECT id, name FROM (
  SELECT p1.id, p1.name
  FROM parent AS p1 LEFT JOIN relationship as r1 ON(r1.parent_id=p1.id)
  WHERE r1.other_id = 100
  UNION ALL
  SELECT p2.id, p2.name
  FROM parent AS p2 LEFT JOIN relationship as r2 ON(r2.parent_id=p2.id)
  WHERE r2.other_id = 101
  UNION ALL
  SELECT p3.id, p3.name
  FROM parent AS p3 LEFT JOIN relationship as r3 ON(r3.parent_id=p3.id)
  WHERE r3.other_id = 102
  UNION ALL
  SELECT p4.id, p4.name
  FROM parent AS p4 LEFT JOIN relationship as r4 ON(r4.parent_id=p4.id)
  WHERE r4.other_id = 103
) a
GROUP BY id, name
HAVING count(*) = 4

Actually the above was producing the wrong data so it's either wrong or I did something wrong with it. Whatever the case, the above is just a bad idea.

If that's not fast then you need to look at the explain plan for the query. You're probably just lacking appropriate indices. Try it with:

CREATE INDEX ON relationship (parent_id, other_id)

Before you go down the route of aggregation (SELECT COUNT(*) FROM ...) you should read SQL Statement - “Join” Vs “Group By and Having”.

Note: The above timings are based on:

CREATE TABLE parent (
  id INT PRIMARY KEY,
  name VARCHAR(50)
);

CREATE TABLE other (
  id INT PRIMARY KEY,
  name VARCHAR(50)
);

CREATE TABLE relationship (
  id INT PRIMARY KEY,
  parent_id INT,
  other_id INT
);

CREATE INDEX idx1 ON relationship (parent_id, other_id);
CREATE INDEX idx2 ON relationship (other_id, parent_id);

and nearly 800,000 records created with:

<?php
ini_set('max_execution_time', 600);

$start = microtime(true);

echo "<pre>\n";
mysql_connect('localhost', 'scratch', 'scratch');
if (mysql_error()) {
    echo "Connect error: " . mysql_error() . "\n";
}
mysql_select_db('scratch');
if (mysql_error()) {
    echo "Selct DB error: " . mysql_error() . "\n";
}

define('PARENTS', 100000);
define('CHILDREN', 100000);
define('MAX_CHILDREN', 10);
define('SCATTER', 10);
$rel = 0;
for ($i=1; $i<=PARENTS; $i++) {
    query("INSERT INTO parent VALUES ($i, 'Parent $i')");
    $potential = range(max(1, $i - SCATTER), min(CHILDREN, $i + SCATTER));
    $elements = sizeof($potential);
    $other = rand(1, min(MAX_CHILDREN, $elements - 4));
    $j = 0;
    while ($j < $other) {
        $index = rand(0, $elements - 1);
        if (isset($potential[$index])) {
            $c = $potential[$index];
            $rel++;
            query("INSERT INTO relationship VALUES ($rel, $i, $c)");
            unset($potential[$index]);
            $j++;
        }
    }
}
for ($i=1; $i<=CHILDREN; $i++) {
    query("INSERT INTO other VALUES ($i, 'Other $i')");
}

$count = PARENTS + CHILDREN + $rel;
$stop = microtime(true);
$duration = $stop - $start;
$insert = $duration / $count;

echo "$count records added.\n";
echo "Program ran for $duration seconds.\n";
echo "Insert time $insert seconds.\n";
echo "</pre>\n";

function query($str) {
    mysql_query($str);
    if (mysql_error()) {
        echo "$str: " . mysql_error() . "\n";
    }
}
?>

So once again joins carry the day.

cletus
This is exactly what I have, just written differently.
Darryl Hein
You completely changed your answer...
Darryl Hein
Yes. Because I misunderstood the question.
cletus
Hmmm ... looks like my first guess was right ... and you didn't even need DISTINCT. It's nice when the best answer can be determnined by testing, rather than votes. Thanks, cletus.
le dorfier
it proves that sometimes the not generic query is better than the generic, of course not always you can dinamically create the query.
pablito
A: 

I haven't actually tested it, but something along the lines of:

SELECT id, name FROM (
  SELECT p1.id, p1.name
  FROM parent AS p1 LEFT JOIN relationship as r1 ON(r1.parent_id=p1.id)
  WHERE r1.other_id = 1
  UNION ALL
  SELECT p2.id, p2.name
  FROM parent AS p2 LEFT JOIN relationship as r2 ON(r2.parent_id=p2.id)
  WHERE r2.other_id = 2
   -- etc
) GROUP BY id, name
HAVING count(*) = 2

The idea is you don't have to do multi-way joins; just concatenate the results of regular joins, group by your ids, and pick the rows that showed up in every segment.

SquareCog
Hmmm, that could work. I think it's even more messy than what I have, but maybe more obvious.
Darryl Hein
UNION in subquery = REALLY REALLY bad. Don't do this.
cletus
it's not pretty, but the code to generate it is straightforward, and I suspect you'll see a big performance improvement when you have many parents. If you try it, comment with your results -- I am curious.
SquareCog
Cletus -- really? I know Oracle doesn't care, but I do have some traumatic experience with subqueries in MySQL.. what does it do that makes it bad?
SquareCog
Oracle doesn't? Er I've worked on systems absolutely crippled by UNION and UNION ALL usage...
cletus
Oracle doesn't seem to care whether a query is a subquery or not. I thought you were saying don't use unions in subqueries in particular -- were you saying don't use unions, in general?
SquareCog
+2  A: 

Given that parent table contains unique key on (parent_id, other_id) you can do this:

select p.id, p.name 
  from parent as p 
 where (select count(*) 
        from relationship as r 
       where r.parent_id = p.id 
         and r.other_id in (1,2)
        ) >= 2
grigory
Very good idea... now to attempt to integrate with the rest of the SQL statement...hmmm
Darryl Hein
Warning: before going down that route read http://stackoverflow.com/questions/477006/sql-statement-join-vs-group-by-and-having/477013#477013
cletus
This solution (with aggregation) is more efficient than multiple selects if you expect to have varying number of children. Adding ids to the IN list is more readable and possibly more efficient than adding conditions with SELECT per id. In any case check query plan to compare queries...
grigory
A: 

This is a common problem when searching multiple associates via a many to many join. This is often encountered in services using the 'tag' concept e.g. Stackoverflow

See my other post on a better architecture for tag (in your case 'other') storage

Searching is a two step process:

  1. Find all possible candiates of TagCollections that have any/all the tags you require (may be easier using a cursor of loop construct)
  2. Select data based that matches TagCollection

Performance is always faster due to there being significantly less TagCollections than data items to search

TFD
A: 

You can do it with a nested select , I tested it in MSSQL 2005 but as you said it should be pretty generic

SELECT * FROM parent p
WHERE p.id in(
    SELECT r.parent_Id 
    FROM relationship r 
    WHERE r.parent_id in(1,2) 
    GROUP BY r.parent_id
    HAVING COUNT(r.parent_Id)=2
)

and the number 2 in COUNT(r.parent_Id)=2 is according to the number of joins you need)

pablito
why did I get a downvote for a tested working query? (at least you can explain what is wrong with it)
pablito
Wasn't me, but maybe because it's almost the exact same as: http://stackoverflow.com/questions/599461/how-do-you-perform-an-and-with-a-join/599485#599485
Darryl Hein
well, while i was building it in MSSQL Studio I was not aware that a similar answer was posted, still this is not a reason for a downvote.
pablito
I'll give you an upvote to make it 0...although it still gives you more rep.
Darryl Hein
Thanks, just to make clear the rep is not the issue, I am here to learn if I had a wrong answer I would just like to know what is wrong.
pablito
Shouldn't the r.parent_id in (1, 2) be r.other_id in (1, 2)?
Tom H.
+1  A: 

Simplifying a bit, this should work, and efficiently.

SELECT DISTINCT p.id, p.name
FROM parent p
INNER JOIN relationship r1 ON p.id = r1.parent_id AND r1.other_id = 1
INNER JOIN relationship r2 ON p.id = r2.parent_id AND r2.other_id = 2

will require at least one joined record for each "other" value. And the optimizer should know it only has to find one match each, and it only needs to read the index, not either of the subsidiary tables, one of which isn't even referenced at all.

le dorfier
A: 

If you can put your list of other_id values into a table that would be ideal. The code below looks for parents with AT LEAST the ids given. If you want it to have EXACTLY the same ids (i.e. no extras) you would have to change the query slightly.

SELECT
     p.id,
     p.name
FROM
     My_Other_IDs MOI
INNER JOIN Relationships R ON
     R.other_id = MOI.other_id
INNER JOIN Parents P ON
     P.parent_id = R.parent_id
GROUP BY
     p.parent_id,
     p.name
HAVING
     COUNT(*) = (SELECT COUNT(*) FROM My_Other_IDs)
Tom H.