ansaurus

Question

Answer 1

+4 A:

Ok, I tested this. The queries from best to worst were:

Query 1: Joins (0.016s; basically instant)

SELECT p.id, name
FROM parent p
JOIN relationship r1 ON p.id = r1.parent_id AND r1.other_id = 100
JOIN relationship r2 ON p.id = r2.parent_id AND r2.other_id = 101
JOIN relationship r3 ON p.id = r3.parent_id AND r3.other_id = 102
JOIN relationship r4 ON p.id = r4.parent_id AND r4.other_id = 103

Query 2: EXISTS (0.625s)

SELECT id, name
FROM parent p
WHERE EXISTS (SELECT 1 FROM relationship WHERE parent_id = p.id AND other_id = 100)
AND EXISTS (SELECT 1 FROM relationship WHERE parent_id = p.id AND other_id = 101)
AND EXISTS (SELECT 1 FROM relationship WHERE parent_id = p.id AND other_id = 102)
AND EXISTS (SELECT 1 FROM relationship WHERE parent_id = p.id AND oth

Query 3: Aggregate (1.016s)

SELECT p.id, p.name FROM parent p WHERE (SELECT COUNT(*) FROM relationship WHERE parent_id = p.id AND other_id IN (100,101,102,103))

Query 4: UNION Aggregate (2.39s)

SELECT id, name FROM (
  SELECT p1.id, p1.name
  FROM parent AS p1 LEFT JOIN relationship as r1 ON(r1.parent_id=p1.id)
  WHERE r1.other_id = 100
  UNION ALL
  SELECT p2.id, p2.name
  FROM parent AS p2 LEFT JOIN relationship as r2 ON(r2.parent_id=p2.id)
  WHERE r2.other_id = 101
  UNION ALL
  SELECT p3.id, p3.name
  FROM parent AS p3 LEFT JOIN relationship as r3 ON(r3.parent_id=p3.id)
  WHERE r3.other_id = 102
  UNION ALL
  SELECT p4.id, p4.name
  FROM parent AS p4 LEFT JOIN relationship as r4 ON(r4.parent_id=p4.id)
  WHERE r4.other_id = 103
) a
GROUP BY id, name
HAVING count(*) = 4

Actually the above was producing the wrong data so it's either wrong or I did something wrong with it. Whatever the case, the above is just a bad idea.

If that's not fast then you need to look at the explain plan for the query. You're probably just lacking appropriate indices. Try it with:

CREATE INDEX ON relationship (parent_id, other_id)

Before you go down the route of aggregation (SELECT COUNT(*) FROM ...) you should read SQL Statement - “Join” Vs “Group By and Having”.

Note: The above timings are based on:

CREATE TABLE parent (
  id INT PRIMARY KEY,
  name VARCHAR(50)
);

CREATE TABLE other (
  id INT PRIMARY KEY,
  name VARCHAR(50)
);

CREATE TABLE relationship (
  id INT PRIMARY KEY,
  parent_id INT,
  other_id INT
);

CREATE INDEX idx1 ON relationship (parent_id, other_id);
CREATE INDEX idx2 ON relationship (other_id, parent_id);

and nearly 800,000 records created with:

<?php
ini_set('max_execution_time', 600);

$start = microtime(true);

echo "<pre>\n";
mysql_connect('localhost', 'scratch', 'scratch');
if (mysql_error()) {
    echo "Connect error: " . mysql_error() . "\n";
}
mysql_select_db('scratch');
if (mysql_error()) {
    echo "Selct DB error: " . mysql_error() . "\n";
}

define('PARENTS', 100000);
define('CHILDREN', 100000);
define('MAX_CHILDREN', 10);
define('SCATTER', 10);
$rel = 0;
for ($i=1; $i<=PARENTS; $i++) {
    query("INSERT INTO parent VALUES ($i, 'Parent $i')");
    $potential = range(max(1, $i - SCATTER), min(CHILDREN, $i + SCATTER));
    $elements = sizeof($potential);
    $other = rand(1, min(MAX_CHILDREN, $elements - 4));
    $j = 0;
    while ($j < $other) {
        $index = rand(0, $elements - 1);
        if (isset($potential[$index])) {
            $c = $potential[$index];
            $rel++;
            query("INSERT INTO relationship VALUES ($rel, $i, $c)");
            unset($potential[$index]);
            $j++;
        }
    }
}
for ($i=1; $i<=CHILDREN; $i++) {
    query("INSERT INTO other VALUES ($i, 'Other $i')");
}

$count = PARENTS + CHILDREN + $rel;
$stop = microtime(true);
$duration = $stop - $start;
$insert = $duration / $count;

echo "$count records added.\n";
echo "Program ran for $duration seconds.\n";
echo "Insert time $insert seconds.\n";
echo "</pre>\n";

function query($str) {
    mysql_query($str);
    if (mysql_error()) {
        echo "$str: " . mysql_error() . "\n";
    }
}
?>

So once again joins carry the day.

cletus 2009-03-01 08:08:27

This is exactly what I have, just written differently.

Darryl Hein 2009-03-01 08:10:07

You completely changed your answer...

Darryl Hein 2009-03-01 08:32:07

Yes. Because I misunderstood the question.

cletus 2009-03-01 08:36:26

Hmmm ... looks like my first guess was right ... and you didn't even need DISTINCT. It's nice when the best answer can be determnined by testing, rather than votes. Thanks, cletus.

le dorfier 2009-03-01 17:48:01

it proves that sometimes the not generic query is better than the generic, of course not always you can dinamically create the query.

pablito 2009-03-02 09:12:39

Answer 2

A:

I haven't actually tested it, but something along the lines of:

SELECT id, name FROM (
  SELECT p1.id, p1.name
  FROM parent AS p1 LEFT JOIN relationship as r1 ON(r1.parent_id=p1.id)
  WHERE r1.other_id = 1
  UNION ALL
  SELECT p2.id, p2.name
  FROM parent AS p2 LEFT JOIN relationship as r2 ON(r2.parent_id=p2.id)
  WHERE r2.other_id = 2
   -- etc
) GROUP BY id, name
HAVING count(*) = 2

The idea is you don't have to do multi-way joins; just concatenate the results of regular joins, group by your ids, and pick the rows that showed up in every segment.

SquareCog 2009-03-01 08:12:52

Hmmm, that could work. I think it's even more messy than what I have, but maybe more obvious.

Darryl Hein 2009-03-01 08:16:03

UNION in subquery = REALLY REALLY bad. Don't do this.

cletus 2009-03-01 08:18:49

it's not pretty, but the code to generate it is straightforward, and I suspect you'll see a big performance improvement when you have many parents. If you try it, comment with your results -- I am curious.

SquareCog 2009-03-01 08:19:03

Cletus -- really? I know Oracle doesn't care, but I do have some traumatic experience with subqueries in MySQL.. what does it do that makes it bad?

SquareCog 2009-03-01 08:20:23

Oracle doesn't? Er I've worked on systems absolutely crippled by UNION and UNION ALL usage...

cletus 2009-03-01 08:28:23

Oracle doesn't seem to care whether a query is a subquery or not. I thought you were saying don't use unions in subqueries in particular -- were you saying don't use unions, in general?

SquareCog 2009-03-01 08:30:55

Answer 3

+2 A:

Given that parent table contains unique key on (parent_id, other_id) you can do this:

select p.id, p.name 
  from parent as p 
 where (select count(*) 
        from relationship as r 
       where r.parent_id = p.id 
         and r.other_id in (1,2)
        ) >= 2

grigory 2009-03-01 08:16:02

Very good idea... now to attempt to integrate with the rest of the SQL statement...hmmm

Darryl Hein 2009-03-01 08:19:17

Warning: before going down that route read http://stackoverflow.com/questions/477006/sql-statement-join-vs-group-by-and-having/477013#477013

cletus 2009-03-01 08:23:03

This solution (with aggregation) is more efficient than multiple selects if you expect to have varying number of children. Adding ids to the IN list is more readable and possibly more efficient than adding conditions with SELECT per id. In any case check query plan to compare queries...

grigory 2009-03-01 08:51:24

Answer 4

A:

This is a common problem when searching multiple associates via a many to many join. This is often encountered in services using the 'tag' concept e.g. Stackoverflow

See my other post on a better architecture for tag (in your case 'other') storage

Searching is a two step process:

Find all possible candiates of TagCollections that have any/all the tags you require (may be easier using a cursor of loop construct)
Select data based that matches TagCollection

Performance is always faster due to there being significantly less TagCollections than data items to search

TFD 2009-03-01 08:17:39

Answer 5

A:

You can do it with a nested select , I tested it in MSSQL 2005 but as you said it should be pretty generic

SELECT * FROM parent p
WHERE p.id in(
    SELECT r.parent_Id 
    FROM relationship r 
    WHERE r.parent_id in(1,2) 
    GROUP BY r.parent_id
    HAVING COUNT(r.parent_Id)=2
)

and the number 2 in COUNT(r.parent_Id)=2 is according to the number of joins you need)

pablito 2009-03-01 08:27:39

why did I get a downvote for a tested working query? (at least you can explain what is wrong with it)

pablito 2009-03-01 08:43:34

Wasn't me, but maybe because it's almost the exact same as: http://stackoverflow.com/questions/599461/how-do-you-perform-an-and-with-a-join/599485#599485

Darryl Hein 2009-03-01 08:45:46

well, while i was building it in MSSQL Studio I was not aware that a similar answer was posted, still this is not a reason for a downvote.

pablito 2009-03-01 08:50:38

I'll give you an upvote to make it 0...although it still gives you more rep.

Darryl Hein 2009-03-01 08:53:36

Thanks, just to make clear the rep is not the issue, I am here to learn if I had a wrong answer I would just like to know what is wrong.

pablito 2009-03-01 08:58:36

Shouldn't the r.parent_id in (1, 2) be r.other_id in (1, 2)?

Tom H. 2009-03-01 16:03:54

Answer 6

+1 A:

Simplifying a bit, this should work, and efficiently.

SELECT DISTINCT p.id, p.name
FROM parent p
INNER JOIN relationship r1 ON p.id = r1.parent_id AND r1.other_id = 1
INNER JOIN relationship r2 ON p.id = r2.parent_id AND r2.other_id = 2

will require at least one joined record for each "other" value. And the optimizer should know it only has to find one match each, and it only needs to read the index, not either of the subsidiary tables, one of which isn't even referenced at all.

le dorfier 2009-03-01 08:28:21

Answer 7

A:

If you can put your list of other_id values into a table that would be ideal. The code below looks for parents with AT LEAST the ids given. If you want it to have EXACTLY the same ids (i.e. no extras) you would have to change the query slightly.

SELECT
     p.id,
     p.name
FROM
     My_Other_IDs MOI
INNER JOIN Relationships R ON
     R.other_id = MOI.other_id
INNER JOIN Parents P ON
     P.parent_id = R.parent_id
GROUP BY
     p.parent_id,
     p.name
HAVING
     COUNT(*) = (SELECT COUNT(*) FROM My_Other_IDs)

Tom H. 2009-03-01 16:18:43

ansaurus

tags:

views:

answers:

How do you perform an AND with a join?

related questions