Ok, apparently, based on the upvotes for the other answer, this needs further explanation. Example (done with MySQL because I have it handy but the principle is universal to any SQL dialect):
CREATE TABLE Blah (
ID INT PRIMARY KEY,
SomeText VARCHAR(30),
ParentID INT
)
INSERT INTO Blah VALUES (1, 'One', 0);
INSERT INTO Blah VALUES (2, 'Two', 0);
INSERT INTO Blah VALUES (3, 'Three', 1);
INSERT INTO Blah VALUES (4, 'Four', 1);
INSERT INTO Blah VALUES (5, 'Five', 4);
Left join version:
SELECT a.ID, a.SomeText, COUNT(1)
FROM Blah a
JOIN Blah b ON a.ID= b.ParentID
GROUP BY a.ID, a.SomeText
Wrong. Ignores the case with no children.
Left outer join:
SELECT a.ID, a.SomeText, COUNT(1)
FROM Blah a
LEFT OUTER JOIN Blah b ON a.ID= b.ParentID
GROUP BY a.ID, a.SomeText
Wrong and the reason why is somewhat subtle. COUNT(1)
counts NULL
rows whereas COUNT(b.ID)
doesn't. So the above is wrong but this is correct:
SELECT a.ID, a.SomeText, COUNT(b.ID)
FROM Blah a
LEFT OUTER JOIN Blah b ON a.ID= b.ParentID
GROUP BY a.ID, a.SomeText
Correlated subquery:
SELECT ID, SomeText, (SELECT COUNT(1) FROM Blah WHERE ParentID= a.ID) ChildCount
FROM Blah a
Also correct.
Ok, so which to use? Plans only tell you so much. The issue of subqueries vs left-joins is an old one and there's no clear answer without benchmarking it. So we need some data:
<?php
ini_set('max_execution_time', 180);
$start = microtime(true);
echo "<pre>\n";
mysql_connect('localhost', 'scratch', 'scratch');
if (mysql_error()) {
echo mysql_error();
exit();
}
mysql_select_db('scratch');
if (mysql_error()) {
echo mysql_error();
exit();
}
$count = 0;
$limit = 1000000;
$this_level = array(0);
$next_level = array();
while ($count < $limit) {
foreach ($this_level as $parent) {
$child_count = rand(0, 3);
for ($i=0; $i<$child_count; $i++) {
$count++;
query("INSERT INTO Blah (ID, SomeText, ParentID) VALUES ($count, 'Text $count', $parent)");
$next_level[] = $count;
}
}
$this_level = $next_level;
$next_level = array();
}
$stop = microtime(true);
$duration = $stop - $start;
$inserttime = $duration / $count;
echo "$count users added.\n";
echo "Program ran for $duration seconds.\n";
echo "Insert time $inserttime seconds.\n";
echo "</pre>\n";
function query($query) {
mysql_query($query);
if (mysql_error()) {
echo mysql_error();
exit();
}
}
?>
I ran out of memory (32M) during this run so only ended up with 876,109 records but hey it will do. Later, when I test Oracle and SQL Server I take the exact same set of data and import it into Oracle XE and SQL Server Express 2005.
Now another poster raised the issue of my using a count wrapper around the queries. He correctly pointed out that the optimizer may not execute the subqueries in that case. MySQL doesn't seem to be that smart. Oracle is. SQL Server seems to be as well.
So I'll quote two figures for each database-query combination: the first is wrapped in SELECT COUNT(1) FROM ( ... )
, the second is raw.
Setup:
- MySQL 5.0 using PremiumSoft Navicat (
LIMIT 10000
in query);
- SQL Server Express 2005 using Microsoft SQL Server Management Studio Express;
- Oracle XE using PL/SQL Developer 7 (limited to 10,000 rows).
Left outer join:
SELECT a.ID, a.SomeText, COUNT(b.ID)
FROM Blah a
LEFT OUTER JOIN Blah b ON a.ID= b.ParentID
GROUP BY a.ID, a.SomeText
- MySQL: 5.0: 51.469s / 49.907s
- SQL Server: 0(1) / 9s(2)
- Oracle XE: 1.297s / 2.656s
(1) Virtually instantaneous (confirming the different execution path)
(2) Impressive considering it is returning all the rows, not 10,000
Just goes to show the value of a real database. Also, removing the SomeText field had a significant impact on MySQL's performance. Also there wasn't much difference between the limit of 10000 and not having it with MySQL (improving performance by a factor of 4-5). Oracle had it just because PL/SQL Developer barfed when it hit 100M memory usage.
Correlated Subquery:
SELECT ID, SomeText, (SELECT COUNT(1) FROM Blah WHERE ParentID= a.ID) ChildCount
FROM Blah a
- MySQL: 8.844s / 11.10s
- SQL Server: 0s / 6s
- Oracle: 0.046s / 1.563s
So MySQL is better by a factor of 4-5, Oracle is about twice as fast and SQL Server is arguably only a little faster.
The point remains: the correlated subquery version is faster in all cases.
The other advantage of correlated subqueries is that they are syntactically cleaner and easier to extend. By this I mean that if you want to do a count in a bunch of other tables, each can be included as another select item cleanly and easily. For example: imagine a record of customers to invoices where those invoices were either unpaid, overdue or paid. With a subquery that is easy:
SELECT id,
(SELECT COUNT(1) FROM invoices WHERE customer_id = c.id AND status = 'UNPAID') unpaid_invoices,
(SELECT COUNT(1) FROM invoices WHERE customer_id = c.id AND status = 'OVERDUE') overdue_invoices,
(SELECT COUNT(1) FROM invoices WHERE customer_id = c.id AND status = 'PAID') paid_invoices
FROM customers c
The aggregate version is a lot uglier.
Now I'm not saying that subqueries are always superior to aggregate joins but often enough they are that you have to test it. Depending on your data, the size of that data and your RDBMS vendor the difference can be hugely significant.