This question originates from a discussion on whether to use SQL ranking functionality or not in a particular case.
Any common RDBMS includes some ranking functionality, i.e. it's query language has elements like TOP n ... ORDER BY key
, ROW_NUMBER() OVER (ORDER BY key)
, or ORDER BY key LIMIT n
(overview).
They do a great job in increasing performance if you want to present only a small chunk out of a huge number of records. But they also introduce a major pitfall: If key
is not unique results are non-deterministic. Consider the following example:
users
user_id name
1 John
2 Paul
3 George
4 Ringo
logins
login_id user_id login_date
1 4 2009-08-17
2 1 2009-08-18
3 2 2009-08-19
4 3 2009-08-20
A query is supposed to return the person who logged in last:
SELECT TOP 1 users.*
FROM
logins JOIN
users ON logins.user_id = users.user_id
ORDER BY logins.login_date DESC
Just as expected George
is returned and everything looks fine. But then a new record is inserted into logins
table:
1 4 2009-08-17
2 1 2009-08-18
3 2 2009-08-19
4 3 2009-08-20
5 4 2009-08-20
What does the query above return now? Ringo
? George
? You can't tell. As far as I remember e.g. MySQL 4.1 returns the first record physically created that matches the criteria, i.e. the result would be George
. But this may vary from version to version and from DBMS to DBMS. What should have been returned? One might say Ringo
since he apparently logged in last but this is pure interpretation. In my opinion both should have been returned, because you can't decide unambiguously from the data available.
So this query matches the requirements:
SELECT users.*
FROM
logins JOIN
users ON
logins.user_id = users.user_id AND
logins.login_date = (
SELECT max(logins.login_date)
FROM
logins JOIN
users ON logins.user_id = users.user_id)
As an alternative some DBMSs provide special functions (e.g. Microsoft SQL Server 2005 introduces TOP n WITH TIES ... ORDER BY key
(suggested by gbn), RANK
, and DENSE_RANK
for this very purpose).
If you search SO for e.g. ROW_NUMBER
you'll find numerous solutions which suggest using ranking functionality and miss to point out the possible problems.
Question: What advice should be given if a solution that includes ranking functionality is proposed?