views:

4871

answers:

5

I can't seem to find a relevant example out there.

I'm trying to return a sub-set of a table, and for each row in that table, I want to check how many children it has, and return that number as part of the result set.

Parent Table Columns: PK_ID, Column1, Column2, FK1

For each FK1 in result set, select count(*) from child_table.

Final result set

3, col1text, col2text, 1(child)
5, col1texta, col2texta, 2(child)
6, col1textb, col2textb, 0(child)
9, col1textc, col2textc, 4(child)

I'm struggling with the best way to reference a column in the result set in another query, and then join them together again. Using T-sql

+5  A: 

Ok, apparently, based on the upvotes for the other answer, this needs further explanation. Example (done with MySQL because I have it handy but the principle is universal to any SQL dialect):

CREATE TABLE Blah (
  ID INT PRIMARY KEY,
  SomeText VARCHAR(30),
  ParentID INT
)

INSERT INTO Blah VALUES (1, 'One', 0);
INSERT INTO Blah VALUES (2, 'Two', 0);
INSERT INTO Blah VALUES (3, 'Three', 1);
INSERT INTO Blah VALUES (4, 'Four', 1);
INSERT INTO Blah VALUES (5, 'Five', 4);

Left join version:

SELECT a.ID, a.SomeText, COUNT(1)
FROM Blah a
JOIN Blah b ON a.ID= b.ParentID
GROUP BY a.ID, a.SomeText

Wrong. Ignores the case with no children.

Left outer join:

SELECT a.ID, a.SomeText, COUNT(1)
FROM Blah a
LEFT OUTER JOIN Blah b ON a.ID= b.ParentID
GROUP BY a.ID, a.SomeText

Wrong and the reason why is somewhat subtle. COUNT(1) counts NULL rows whereas COUNT(b.ID) doesn't. So the above is wrong but this is correct:

SELECT a.ID, a.SomeText, COUNT(b.ID)
FROM Blah a
LEFT OUTER JOIN Blah b ON a.ID= b.ParentID
GROUP BY a.ID, a.SomeText

Correlated subquery:

SELECT ID, SomeText, (SELECT COUNT(1) FROM Blah WHERE ParentID= a.ID) ChildCount
FROM Blah a

Also correct.

Ok, so which to use? Plans only tell you so much. The issue of subqueries vs left-joins is an old one and there's no clear answer without benchmarking it. So we need some data:

<?php
ini_set('max_execution_time', 180);

$start = microtime(true);

echo "<pre>\n";

mysql_connect('localhost', 'scratch', 'scratch');
if (mysql_error()) {
    echo mysql_error();
    exit();
}
mysql_select_db('scratch');
if (mysql_error()) {
    echo mysql_error();
    exit();
}

$count = 0;
$limit = 1000000;
$this_level = array(0);
$next_level = array();

while ($count < $limit) {
    foreach ($this_level as $parent) {
     $child_count = rand(0, 3);
     for ($i=0; $i<$child_count; $i++) {
      $count++;
      query("INSERT INTO Blah (ID, SomeText, ParentID) VALUES ($count, 'Text $count', $parent)");
      $next_level[] = $count;
     }
    }
    $this_level = $next_level;
    $next_level = array();
}

$stop = microtime(true);
$duration = $stop - $start;
$inserttime = $duration / $count;

echo "$count users added.\n";
echo "Program ran for $duration seconds.\n";
echo "Insert time $inserttime seconds.\n";
echo "</pre>\n";

function query($query) {
    mysql_query($query);
    if (mysql_error()) {
     echo mysql_error();
     exit();
    }
}
?>

I ran out of memory (32M) during this run so only ended up with 876,109 records but hey it will do. Later, when I test Oracle and SQL Server I take the exact same set of data and import it into Oracle XE and SQL Server Express 2005.

Now another poster raised the issue of my using a count wrapper around the queries. He correctly pointed out that the optimizer may not execute the subqueries in that case. MySQL doesn't seem to be that smart. Oracle is. SQL Server seems to be as well.

So I'll quote two figures for each database-query combination: the first is wrapped in SELECT COUNT(1) FROM ( ... ), the second is raw.

Setup:

  • MySQL 5.0 using PremiumSoft Navicat (LIMIT 10000 in query);
  • SQL Server Express 2005 using Microsoft SQL Server Management Studio Express;
  • Oracle XE using PL/SQL Developer 7 (limited to 10,000 rows).

Left outer join:

SELECT a.ID, a.SomeText, COUNT(b.ID)
FROM Blah a
LEFT OUTER JOIN Blah b ON a.ID= b.ParentID
GROUP BY a.ID, a.SomeText
  • MySQL: 5.0: 51.469s / 49.907s
  • SQL Server: 0(1) / 9s(2)
  • Oracle XE: 1.297s / 2.656s

(1) Virtually instantaneous (confirming the different execution path)
(2) Impressive considering it is returning all the rows, not 10,000

Just goes to show the value of a real database. Also, removing the SomeText field had a significant impact on MySQL's performance. Also there wasn't much difference between the limit of 10000 and not having it with MySQL (improving performance by a factor of 4-5). Oracle had it just because PL/SQL Developer barfed when it hit 100M memory usage.

Correlated Subquery:

SELECT ID, SomeText, (SELECT COUNT(1) FROM Blah WHERE ParentID= a.ID) ChildCount
FROM Blah a
  • MySQL: 8.844s / 11.10s
  • SQL Server: 0s / 6s
  • Oracle: 0.046s / 1.563s

So MySQL is better by a factor of 4-5, Oracle is about twice as fast and SQL Server is arguably only a little faster.

The point remains: the correlated subquery version is faster in all cases.

The other advantage of correlated subqueries is that they are syntactically cleaner and easier to extend. By this I mean that if you want to do a count in a bunch of other tables, each can be included as another select item cleanly and easily. For example: imagine a record of customers to invoices where those invoices were either unpaid, overdue or paid. With a subquery that is easy:

SELECT id,
  (SELECT COUNT(1) FROM invoices WHERE customer_id = c.id AND status = 'UNPAID') unpaid_invoices,
  (SELECT COUNT(1) FROM invoices WHERE customer_id = c.id AND status = 'OVERDUE') overdue_invoices,
  (SELECT COUNT(1) FROM invoices WHERE customer_id = c.id AND status = 'PAID') paid_invoices
FROM customers c

The aggregate version is a lot uglier.

Now I'm not saying that subqueries are always superior to aggregate joins but often enough they are that you have to test it. Depending on your data, the size of that data and your RDBMS vendor the difference can be hugely significant.

cletus
count(b.id) will give the correct result with a left outer join. It will also force one sequential scan, and one index scan. Correlated subqueries force a nested loop, which means orders of magnitude worse performance.
SquareCog
@cletus: use COUNT(b.parentID) instead of COUNT(1). Where b is NULL because of the LEFT OUTER JOIN, it won't contribute to the count. Also, LEFT JOINs are always outer joins.
Bill Karwin
Your original answer was so much better (and correct).
David B
Benchmarks added and error corrected.
cletus
+1 for the *insane* follow-up research!
Sean Bright
Please see my answer for an explanation of why your measurements are fundamentally incorrect.
SquareCog
Cletus, these are very impressive numbers, which still don't make sense. Can you please show the explain plan? I still fail to understand how N^2 can be less than 2N.
SquareCog
+4  A: 

I believe this is what you are trying to do:

SELECT P.PK_ID, P.Column1, P.Column2, COUNT(C.PK_ID)
FROM
    Parent P
    LEFT JOIN Child C ON C.PK_ID = P.FK1
GROUP BY
    P.PK_ID, P.Column1, P.Column2
Sean Bright
-1 Doesn't handle zero children.
cletus
Ah, right. Good catch.
Sean Bright
Looks like it handles zero children to me. +1
David B
Try it. If you do a left join when there are no children there'll be no row and it won't say "0 children". It'll just not appear.
cletus
And a left outer join won't do it either cos you'll get a count of 1 for the null row unless you try and filter that. The correct solution is the answer I posted.
cletus
In T-SQL, 'LEFT JOIN' implies 'LEFT OUTER JOIN.' Also, while COUNT(*) counts null values, COUNT(Column) does not. So the result is what is expected (0 when there is no matching child row).
Sean Bright
See my answer for further explanation.
cletus
Bencharmked (below). This query is at least 5 times slower than a correlated subquery.
cletus
Nice research! Did you remove your downvote?
Sean Bright
Have now. Thanks for the research comment although with -2 net votes, I'm really not sure why I bother.
cletus
+1  A: 

An explanation of why @cletus is wrong.

First, props on doing the research.

Second, you are doing it wrong.

Explanation:

Original query:

EXPLAIN
SELECT ID, (SELECT COUNT(1) FROM Blah WHERE ParentID= a.ID) as ChildCount
FROM Blah a

Result:

    "Seq Scan on blah a  (cost=0.00..145180063607.45 rows=2773807 width=4)"
    "  SubPlan"
    "    ->  Aggregate  (cost=52339.61..52339.63 rows=1 width=0)"
    "          ->  Seq Scan on blah  (cost=0.00..52339.59 rows=10 width=0)"
    "                Filter: (parentid = $0)"

What happens when you wrap in "select count(1)" :

EXPLAIN SELECT count(1) FROM (
SELECT ID, (SELECT COUNT(1) FROM Blah WHERE ParentID= a.ID) as ChildCount
FROM Blah a) as bar
    "Aggregate  (cost=52339.59..52339.60 rows=1 width=0)"
    "  ->  Seq Scan on blah a  (cost=0.00..45405.07 rows=2773807 width=0)"

Notice the difference?

The optimizer is smart enough to see that it doesn't need to do the subquery. So it's not that correlated subqueries are fast; it's that NOT DOING THEM is fast :-).

Unfortunately it can't do the same for a left outer join, since the number of results is not pre-determined by the first scan.

Lesson #1: The query plans tell you a hell of a lot. Poor experiment design gets you into trouble.

Lesson #1.1: If you don't need to do a join, by all means, don't.

I created a test dataset of roughly 2.7 million queries.

The left outer join -- without the wrapper -- ran 171,757 ms on my laptop.

The correlated subquery... I'll update when it finishes, I am at 700K ms and it's still running.

Lesson #2: When someone tells you to look at the query plan, and claims it's showing an algorithmic order of difference... look at the query plan.

SquareCog
I killed the correlated subquery after it ran for 50 minutes without producing an answer. Keep in mind that the left outer join ran in a little under 3 minutes.
SquareCog
-1 Actually I checked both version (with and without wrapped COUNT(1)) and the durations were approximately the same. There's a difference between measuring actual results and a theoretical explain plan.
cletus
Could you post the run times and explain plans for both queries? It's possible that MySQL does something clever that Postgres, which I ran this on, does not. Highly unlikely given the nature of the two, but possible..
SquareCog
MySQL may well be less clever than SQL Server when it comes to deciding if it needs to executing subqueries. I did check that though. I've been trying to get my SQL Server Express working and import a large amount of data and it's been a nightmare but I'll post that too if/when it's ever working.
cletus
Added Oracle and SQL Server.
cletus
A: 

Your queries all assume that the order that the parent child nodes are entered is sequential. If a child from one of the first nodes is entered at the end and its ID or PK is higher, then the querie doesn't work.

Mark deraeve
A: 

Did you ever try to add an index to parent id for MySQL. I'm pretty sure the exection times will improve vastly. Haven't tested but I would say that MySQL goes through all rows to determine the count. Meaning that it does 10 - 40 billion (number of rows in the table * 10000) lookups in those 59 seconds.

Assume that SQL Server and Oracle create an index on the fly. If they do, it would be only 1 to 4 million.

gunter sammet