views:

323

answers:

6

Is there a faster way to select the distinct count of users from a table? Perhaps using row_number, partitioning, or cross apply?

I just can't think of it right now.

Example:

Table UsageLog

UserId     Date     StoreNumber
Alice      200901   342
Alice      200902   333
Alice      200902   112
Bob        200901   112
Bob        200902   345
Charlie    200903   322

Here's my current query:

select count(distinct userID), date
from
   UsageLog
where
   date between 200901 and 200902
group by date

My actual table has millions of rows and all columns are actually integers.

Is there a faster way to get the list of users?

Edit:

I already have nonclustered indexes on all individual columns. For some reason, the execution plan shows that I am still doing a table scan. I guess I should create a clustered index on Date. I'll see how that works...

A: 

have you tried group by?

for example:

select count(userID), userID
  from UsageLog
 where date between 200901 and 200902
Group by userID

Then do a explain plan on both to compare the performance.

northpole
This will give a different result than the count of distinct userIDs. It will give you for each userID the count.
jvanderh
+2  A: 

Composite index on Date and UserId should help quite a bit

RC
Based on his query he would benefit from an index on the Date as well. That would be the real performance gainer.
Nissan Fan
@Nissan Fan: not really. If is not covering the QO will prefer a clustered scan instead at a surprisingly low 'tipping point'.
Remus Rusanu
Please elaborate Remus on what you mean. If you have a where clause in SQL Server and you put a covering index in place on the fields in the clause you would stand to gain an incredible amount of performance. No index on Date and you get a full table scan.
Nissan Fan
'covering' is the key word here, was missing from your original comment.
Remus Rusanu
+3  A: 

Overall I have not found any way that is faster than what you have there, COUNT(DISTINCT UserId) is a pretty basic query.

Your biggest thing here would be to ensure that you have an index on the table that works for the "Date" column and the UserId column

Mitchel Sellers
+1  A: 

use GROUP BY and make sure you have an index on the UserId column

Draemon
+1  A: 

I ran a few quick tests.

One index on Date and UserID: Execution plan shows an index seek but then a sort to perform the distinct which is slow.

One index on UserID and Date: Execution plan shows an index scan and two computes which result in the lower cost of all the scenarios that I ran.

Other scenarios with just Date or just UserID with index are more expensive that the previous one.

jvanderh
+2  A: 

SELECT DISTINCT() is the way to go. The problem is that you are hitting the date index tipping point, so your plan goes for the clustered index scan instead. See the link for Kimberley L. Tripp article what a 'tipping point' is.

You need a covering index:

CREATE INDEX idx_UsageLog_date_user_id ON UsageLog(date) INCLUDE (userID);

Clustered index will also work, but has other side effects as well. If the clustered index on date is OK with the rest of your data access patterns, then is better than the covering index I propose.

Update:

The reverse order index you tried on (userID, date) also works, will range seek each userID. In fact is better than the (date, userID) or (date) INCLUDE (userID) because it returns the userIDs pre-sorted so the DISTINCT does not introduce the additional sort.

Still I recommend going over the link I posted to understand why 'index on each individual columns' was not helping.

Remus Rusanu
Thanks for the tip. (haha, get it?)
Jeff Meatball Yang