ansaurus

Question

Slow distinct query in SQL Server over large dataset

Answer 1

A:

Second query works on 1000000 records but the first one 200M. I think this is a big difference :)

SimaWB 2009-04-16 06:33:14

Yes, it should be a big difference. But the difference should be reversed, because the first query is using an index, the second query is doing a table scan.

Miquella 2009-04-16 06:53:59

Answer 2

+1 A:

My first thought is statistics. To find last updated:

SELECT
    name AS index_name, 
    STATS_DATE(object_id, index_id) AS statistics_update_date
FROM
    sys.indexes 
WHERE
    object_id = OBJECT_ID('MyTable');

Edit: Stats are updated when indexes are rebuilt, which I see are not maintained

My second thought is that is the index still there? The TOP query should still use an index. I've just tested on one of my tables with 57 million rows and both use the index.

gbn 2009-04-16 06:37:48

Yes, the index is there, and it is using the index. :( That was the first thing I checked. It's scanning the index, but I don't know why scanning the index's only field should take so long...

Miquella 2009-04-16 06:48:51

Answer 3

+5 A:

You do misunderstand the index. Even if it did use the index it would still do an index scan across 200M entries. This is going to take a long time, plus the time it takes to do the DISTINCT (causes a sort) and it's a bad thing to run. Seeing a DISTINCT in a query always raises a red flag and causes me to double check the query. In this case, perhaps you have a normalization issue?

Al W 2009-04-16 06:40:52

No doubt, this is partially a data normalization issue, but we're already having performance problems from normalizing the data before. We're barely staying ahead of the incoming data as it is.

Miquella 2009-04-16 06:45:43

It is an index scan, but shouldn't scanning the index (at least in this case) only hit the tree's nodes, not scan through the leafs?

Miquella 2009-04-16 06:46:35

The index could be heavily fragmented, increasing the scan time. Do you run any maintenance jobs? You should be doing that nightly with that much data. (Assuming you can schedule the time.)

beach 2009-04-16 06:52:38

Hmm... that's a good point. We don't currently do maintenance like that, but you are correct, that much data should have nightly maintenance. I'll have to try that tomorrow.

Miquella 2009-04-16 06:57:31

Exactly - the index will help if a query like "WHERE typename = xyz" is made - but for a select distinct * from ... the index won't help at all. if you have 200 M rows, all 200 M rows will have to be scanned.....

marc_s 2009-04-16 07:01:51

Answer 4

+4 A:

I doubt SQL Server will even try to use the index, it'd have to do practically the same amount of work (given the narrow table), reading all 200M rows regardless of whether it looks at the table or the index. If the index on typeName was clustered it may reduce the time taken as it shouldn't need to sort before grouping.

If the cardinality of your types is low, how about maintaining a summary table which holds the list of distinct type values? A trigger on insert/update of the main table would do a check on the summary table and insert a new record when a new type is found.

ninesided 2009-04-16 06:45:31

+1; trigger on insert is better than what I was thinking of (add a second INSERT after the main, inserting into said summary table, and catching/ignoring UNIQUE constraint violation).

kquinn 2009-04-16 06:49:26

I was thinking the exact same thing. Do the summary table if you need to run the distinct query often. You'll also need to add a DELETE trigger to clean up the table after you remove rows. Or if not a big deal, schedule a SQL Job to update the summary table nightly. (removing deleted types.)

beach 2009-04-16 06:55:17

An idea if DELETEs are involved: have the summary table have a reference count column; triggers on INSERT increment it and on DELETE decrement it. That should work pretty nicely.

kquinn 2009-04-16 06:57:21

Answer 5

+1 A:

As others have already pointed out - when you do a SELECT DISTINCT (typename) over your table, you'll end up with a full table scan no matter what.

So it's really a matter of limiting the number of rows that need to be scanned.

The question is: what do you need your DISTINCT typenames for? And how many of your 200M rows are distinct? Do you have only a handful (a few hundred at most) distinct typenames??

If so - you could have a separate table DISTINCT_TYPENAMES or something and fill those initially by doing a full table scan, and then on inserting new rows to the main table, just always check whether their typename is already in DISTINCT_TYPENAMES, and if not, add it.

That way, you'd have a separate, small table with just the distinct TypeName entries, which would be lightning fast to query and/or to display.

Marc

marc_s 2009-04-16 07:05:35

It is a index scan, not a table scan (I already verified this). It was my understanding that if the index is built correctly, it would simply scan the index, rather than the entire table.

Miquella 2009-04-16 07:10:33

It can and does scan the index instead of the table. But this is not the problem indexes are designed to solve, and so a full index scan cannot resolve this query significantly faster than a full table scan.

kquinn 2009-04-16 07:17:59

The index will still contain 200M entries......

marc_s 2009-04-16 07:27:07

And in this case, the non-clustered index will already contain two of the fields (the "typeName" and the "id" for the actual lookup), so the non-clustered index is almost the whole table anyway...

marc_s 2009-04-16 08:23:36

Answer 6

A:

I should try something like this:

SELECT typeName FROM [types] WITH (nolock)
group by typeName;

And like other i would say you need to normalize that column.

Pitming 2009-04-16 07:26:51

Answer 7

A:

An index helps you to quickly find a row. But you're asking the database to list all unique types for the entire table. An index can help with that.

You could run a nightly job which runs the query and stores it in a different table. If you require up-to-date data, you could store the last ID that was included in the nightly scan, and combine the resunts:

select type
from nightlyscan
union
select distinct type
from verybigtable
where rowid > lastscannedid

Another option is to normalize the big table into two tables:

talbe1: id, guid, typeid
type table: typeid, typename

This would be very benificient if the number of types was relatively small.

Andomar 2009-04-16 07:48:50

ansaurus

tags:

views:

answers:

Slow distinct query in SQL Server over large dataset

related questions