tags:

views:

1092

answers:

4

In MySQL, how can you select data where every row meats a certain condition? For example lets say I have a table showing when employees arrived at work, it has three fields:

CREATE TABLE ArrivalTimes
(UserID INT
,Day DATE 
,ArrivalTime TIME
);

I want to select all UserIDs of employees who have never been late (arrived 9am or earlier), what's the best way to do this?

+1  A: 

This was a nice thought, but it doesn't work.

SELECT UserID FROM ArrivalTimes WHERE MAX(ArrivalTime) <= '09:00:00' GROUP BY UserID

With this query you will get an error saying: "Invalid use of group function"

Aggregate functions like COUNT, MAX, MIN, AVG, SUM and others by definition perform their function on a set (or group of records) so the MAX(ArrivalTime) needs to be in the form of:

GROUP BY UserID HAVING MAX(ArrivalTime) <= '09:00:00'

See the answer from @Bill Karwin above.

jjclarkson
just add the from and you should be good to go.
Berek Bryan
Maybe you already realized this, since the answer mentions it doesn't work, but you can't use aggregates in a WHERE clause.
R. Bemrose
A: 

SELECT userID, MAX(ArrivalTime) as latest FROM ArrivalTimes WHERE latest <= '9:00:00' GROUP BY userID

davethegr8
You can't use aggregates in a WHERE clause.
R. Bemrose
hm. I guess you could move the max to the select, and where it like so.
davethegr8
+11  A: 

The answers from @jjclarkson and @davethegr8 are close, but you can't put aggregate functions in the WHERE clause. The WHERE clause is evaluated for each row.

You need to evaluate the MAX() expression for each group, so you need to use a HAVING clause.

Try this:

SELECT UserID 
FROM ArrivalTimes
GROUP BY UserID
HAVING MAX(ArrivalTime) <= '09:00:00';


@MBCook comments that HAVING can be slow. You're right, it might not be the absolute quickest way to produce the desired result. But the HAVING solution is the most clear. There are situations where performance has lower priority than clarity and maintainability.

I looked at the EXPLAIN output (on MySQL 5.1.30) for the HAVING solution: no indexes were used, and the extra notes said "Using temporary; Using filesort," which usually means performance will be poor.

Consider the following query:

SELECT DISTINCT a1.UserID
FROM ArrivalTimes a1
  LEFT OUTER JOIN ArrivalTimes a2 
  ON (a1.UserID = a2.UserID AND a2.ArrivalTime > '09:00:00')
WHERE a2.UserID IS NULL;

This generates an optimization plan that uses an index on UserID and says:

  • a1: "Using index; Using temporary"
  • a2: "Using where; Distinct"

Finally, the following query generates an optimization plan that appears to use indexes most effectively, and no temp tables or filesort.

SELECT DISTINCT a1.UserID
FROM ArrivalTimes a1
WHERE NOT EXISTS (SELECT * FROM ArrivalTimes a2 
                  WHERE a1.UserID = a2.UserID 
                    AND a2.ArrivalTime > '09:00:00');
  • a1: "Using where; Using index"
  • a2: "Using where"

This appears most likely to have the best performance. Admittedly, I only have four rows in my test table, so this isn't a representative test.

Bill Karwin
ugh, yeah he's right.
jjclarkson
I'm sometimes surprised how oft forgotten the SQL 'HAVING' keyword is.
brendan
Warning: HAVING can be very slow. It requires pulling all the rows (like there was no HAVING clause) then filtering them out. It's the solution to this problem, but you'll want to be aware of possible performance implications on large result sets (before the HAVING is applied).
MBCook
select * from user a where '09:00:00' >= all( select ArrivalTime from ArrivalTime b where b.UserID = a.ID); uses no group by, temporary, or ilesort, and even more clearly states our specification.
tpdi
+1  A: 

Bill Karwin suggests:

Try this:

SELECT UserID 
FROM ArrivalTimes
GROUP BY UserID
HAVING MAX(ArrivalTime) <= '09:00:00';

I looked at the EXPLAIN output (on MySQL 5.1.30) for the HAVING solution: no indexes were used, and the extra notes said "Using temporary; Using filesort," which usually means performance will be poor.

I'd submit that the following is even clearer, given that there is a user table to which ArrivalTimes.UserId is a foreign key. This selects all never-tardy users:

 select * from user a 
 where '09:00:00' 
  >= all( select ArrivalTime from ArrivalTime b where b.UserID = a.ID);

This selects any user who was ever tardy:

 select * from user a 
 where '09:00:00' 
  < any( select ArrivalTime from ArrivalTime b where b.UserID = a.ID);

This is clearer because it even more closely conforms to our English/natural language specification.

And it avoids the inefficiency of a group by; under MySql 5.0.51, it doesn't require a temporary or filesort, as Bill's does.

(Note that it does require that the constant time value be zero-padded, thus: '09:00:00'; '9:00:00' fails.)

tpdi
+1 Yes, you're right, and nicely written up.
Bill Karwin