ansaurus

Question

Should SQL ranking functionality be considered as "use with caution"

Answer 1

+3 A:

rank and row_number are fantastic functions that should be used more liberally, IMO. Folks just don't know about them.

That being said, you need to make sure what you're ranking by is unique. Have a backup plan for duplicates (esp. dates). The data you get back is only as good as the data you put in.

I think the pitfalls here are the exact same in the query:

select top 2 * from tblA order by date desc

You need to be aware of what you're ordering on and ensure that there is some way to always have a winner. If not, you get a (potentially) random two rows with the max date.

Also, for the record, SQL Server does not store rows in the physical order that they are inserted. It stores records on 8k pages and orders those pages in the most efficient way it can according to the clustered index on the table. Therefore, there is absolutely no guarantee of order in SQL Server.

Eric 2009-08-20 11:28:56

+1. Great advice.

KG 2009-10-04 05:25:08

Answer 2

+1 A:

ROW_NUMBER is a fantastic tool indeed. If misused it can provide non-deterministic results, but so will the other SQL functions. You can have ORDER BY return non-deterministic results as well.

Just know what you are doing.

Developer Art 2009-08-20 11:36:18

Well roared lion. Originally I thought about putting "use your head first" on the considerations list. But what if you're an unexperienced programmer asking a question on SO and someone proposes a TOP .. GROUP BY solution without pointing out it's dangers? You might get into trouble without even realizing...

The Chairman 2009-08-20 11:50:13

@Mao Tsetung: It's the nature of the job. Nothing is that simple or obvious. You have to learn, make mistakes, get burnt, find workarounds and thus accumulate knowledge and experience. There are no shortcuts.

Developer Art 2009-08-20 11:55:33

Answer 3

+2 A:

Use the WITH TIES clause in your example above

SELECT TOP 1 WITH TIES users.*
FROM
  logins JOIN
  users ON logins.user_id = users.user_id
ORDER BY logins.login_date DESC

Use DENSE_RANK as you mentioned

Not put myself in this position Example: Store time too (datetime) and accept the very low risk of a very rare duplicate in the same 3.33 millisecond instant (SQL 2008 is different)

gbn 2009-08-20 12:02:04

+1 since I wasn't aware of `TOP n WITH TIES ... ORDER BY key`. That's another alternative.As You might have expected, I don't agree with You on the date - datetime question. I don't want "very low risk". I want "no risk". Yeah I know... No risk, no fun...

The Chairman 2009-08-20 14:00:40

Answer 4

+2 A:

Every database engine uses some kind of a row identifier so that it can distinguish between two rows.

These identifiers are:

Row pointer in MyISAM
Primary key in InnoDB table with a PRIMARY KEY defined
Uniquifier in InnoDB table without a PRIMARY KEY defined
RID in SQL Server's heap table
Primary key in SQL Server's table clustered on PRIMARY/UNIQUE KEY
Index key + uniquifier in SQL Server's table clustered on a non-unique key
ROWID / UROWID in Oracle
CTID in PostgreSQL.

You don't have an immediate access to the following ones:

Row pointer in MyISAM
Uniquifier in InnoDB table without a PRIMARY KEY defined
RID in SQL Server's heap table
Index key + uniquifier in SQL Server's table clustered on a non-unique key

Besides, you don't have control over the following ones:

ROWID / UROWID in Oracle
CTID in PostgreSQL.

(they can change on updates or restoring from backups)

If two rows are identical in these tables, that means they should be identical from the application's point of view.

They return exactly same results and can be treated as an ultimate uniquifier.

This just means you should always include some kind of a uniquifier you have full control over to the ordering clause to keep your ordering consistent.

If your table has a primary or unique key (even composite), include it into the ordering condition:

SELECT  *
FROM    mytable
ORDER BY
        ordering_column, pk

Otherwise, include all columns into the ordering condition:

SELECT  *
FROM    mytable
ORDER BY
        ordering_column, column1, ..., columnN

The later condition will always return any of the otherwise indistinguishable rows, but since they're indistinguishable anyway, it will look consistent from your applications's point of view.

That, by the way, is another good reason for always having a PRIMARY KEY in your tables.

But do not rely on ROWID / CTID to order rows.

It can easily change on UPDATE so your result order will not be stable anymore.

Quassnoi 2009-08-20 15:06:21

Very detailed view on the make-key-unique-advice. Thanks!

The Chairman 2009-08-20 16:41:27

Answer 5

A:

This is the summary:

Use your head first. Should be obvious, but it is always a good point to start. Do you expect n rows exactly or do you expect a possibly varying number of rows that fulfill a constraint? Reconsider your design. If you're expecting n rows exactly, your model might be designed poorly if it's impossible to identify a row unambiguously. If you expect a possibly varying number of rows, you might need to adjust your UI in order to present your query results.
Add columns to key that make it unique (e.g. PK). You at least gain back control on the returned result. There is almost always a way to do this as Quassnoi pointed out.
Consider using possibly more suitable functions like RANK, DENSE_RANK and TOP n WITH TIES. They are available in Microsoft SQL Server by 2005 version and in PosgreSQL from 8.4 onwards. If these functions are not available, consider using nested queries with aggregation instead of ranking functions.

The Chairman 2009-10-25 17:59:38

ansaurus

tags:

views:

answers:

Should SQL ranking functionality be considered as "use with caution"

related questions