tags:

views:

1217

answers:

7

We're using SQL Server 2005 to track a fair amount of constantly incoming data (5-15 updates per second). We noticed after it has been in production for a couple months that one of the tables has started to take an obscene amount of time to query.

The table has 3 columns:

  • id -- autonumber (clustered)
  • typeUUID -- GUID generated before the insert happens; used to group the types together
  • typeName -- The type name (duh...)

One of the queries we run is a distinct on the typeName field:

SELECT DISTINCT [typeName] FROM [types] WITH (nolock);

The typeName field has a non-clusted, non-unique ascending index on it. The table contains approximately 200M records at the moment. When we run this query, the query took 5m 58s to return! Perhaps we're not understanding how the indexes work... But I didn't think we mis-understood them that much.

To test this a little further, we ran the following query:

SELECT DISTINCT [typeName] FROM (SELECT TOP 1000000 [typeName] FROM [types] WITH (nolock)) AS [subtbl]

This query returns in about 10 seconds, as I would expect, it's scanning the table.

Is there something we're missing here? Why does the first query take so long?

Edit: Ah, my apologies, the first query returns 76 records, thank you ninesided.

Follow up: Thank you all for your answers, it makes more sense to me now (I don't know why it didn't before...). Without an index, it's doing a table scan across 200M rows, with an index, it's doing an index scan across 200M rows...

SQL Server does prefer the index, and it does give a little bit of a performance boost, but nothing to be excited about. Rebuilding the index did take the query time down to just over 3m instead of 6m, an improvement, but not enough. I'm just going to recommend to my boss that we normalize the table structure.

Once again, thank you all for your help!!

A: 

Second query works on 1000000 records but the first one 200M. I think this is a big difference :)

SimaWB
Yes, it should be a big difference. But the difference should be reversed, because the first query is using an index, the second query is doing a table scan.
Miquella
+1  A: 

My first thought is statistics. To find last updated:

SELECT
    name AS index_name, 
    STATS_DATE(object_id, index_id) AS statistics_update_date
FROM
    sys.indexes 
WHERE
    object_id = OBJECT_ID('MyTable');

Edit: Stats are updated when indexes are rebuilt, which I see are not maintained

My second thought is that is the index still there? The TOP query should still use an index. I've just tested on one of my tables with 57 million rows and both use the index.

gbn
Yes, the index is there, and it is using the index. :( That was the first thing I checked. It's scanning the index, but I don't know why scanning the index's only field should take so long...
Miquella
+5  A: 

You do misunderstand the index. Even if it did use the index it would still do an index scan across 200M entries. This is going to take a long time, plus the time it takes to do the DISTINCT (causes a sort) and it's a bad thing to run. Seeing a DISTINCT in a query always raises a red flag and causes me to double check the query. In this case, perhaps you have a normalization issue?

Al W
No doubt, this is partially a data normalization issue, but we're already having performance problems from normalizing the data before. We're barely staying ahead of the incoming data as it is.
Miquella
It is an index scan, but shouldn't scanning the index (at least in this case) only hit the tree's nodes, not scan through the leafs?
Miquella
The index could be heavily fragmented, increasing the scan time. Do you run any maintenance jobs? You should be doing that nightly with that much data. (Assuming you can schedule the time.)
beach
Hmm... that's a good point. We don't currently do maintenance like that, but you are correct, that much data should have nightly maintenance. I'll have to try that tomorrow.
Miquella
Exactly - the index will help if a query like "WHERE typename = xyz" is made - but for a select distinct * from ... the index won't help at all. if you have 200 M rows, all 200 M rows will have to be scanned.....
marc_s
+4  A: 

I doubt SQL Server will even try to use the index, it'd have to do practically the same amount of work (given the narrow table), reading all 200M rows regardless of whether it looks at the table or the index. If the index on typeName was clustered it may reduce the time taken as it shouldn't need to sort before grouping.

If the cardinality of your types is low, how about maintaining a summary table which holds the list of distinct type values? A trigger on insert/update of the main table would do a check on the summary table and insert a new record when a new type is found.

ninesided
+1; trigger on insert is better than what I was thinking of (add a second INSERT after the main, inserting into said summary table, and catching/ignoring UNIQUE constraint violation).
kquinn
I was thinking the exact same thing. Do the summary table if you need to run the distinct query often. You'll also need to add a DELETE trigger to clean up the table after you remove rows. Or if not a big deal, schedule a SQL Job to update the summary table nightly. (removing deleted types.)
beach
An idea if DELETEs are involved: have the summary table have a reference count column; triggers on INSERT increment it and on DELETE decrement it. That should work pretty nicely.
kquinn
+1  A: 

As others have already pointed out - when you do a SELECT DISTINCT (typename) over your table, you'll end up with a full table scan no matter what.

So it's really a matter of limiting the number of rows that need to be scanned.

The question is: what do you need your DISTINCT typenames for? And how many of your 200M rows are distinct? Do you have only a handful (a few hundred at most) distinct typenames??

If so - you could have a separate table DISTINCT_TYPENAMES or something and fill those initially by doing a full table scan, and then on inserting new rows to the main table, just always check whether their typename is already in DISTINCT_TYPENAMES, and if not, add it.

That way, you'd have a separate, small table with just the distinct TypeName entries, which would be lightning fast to query and/or to display.

Marc

marc_s
It is a index scan, not a table scan (I already verified this). It was my understanding that if the index is built correctly, it would simply scan the index, rather than the entire table.
Miquella
It can and does scan the index instead of the table. But this is not the problem indexes are designed to solve, and so a full index scan cannot resolve this query significantly faster than a full table scan.
kquinn
The index will still contain 200M entries......
marc_s
And in this case, the non-clustered index will already contain two of the fields (the "typeName" and the "id" for the actual lookup), so the non-clustered index is almost the whole table anyway...
marc_s
A: 

I should try something like this:

SELECT typeName FROM [types] WITH (nolock)
group by typeName;

And like other i would say you need to normalize that column.

Pitming
A: 

An index helps you to quickly find a row. But you're asking the database to list all unique types for the entire table. An index can help with that.

You could run a nightly job which runs the query and stores it in a different table. If you require up-to-date data, you could store the last ID that was included in the nightly scan, and combine the resunts:

select type
from nightlyscan
union
select distinct type
from verybigtable
where rowid > lastscannedid

Another option is to normalize the big table into two tables:

talbe1: id, guid, typeid
type table: typeid, typename

This would be very benificient if the number of types was relatively small.

Andomar