views:

169

answers:

7

I'm puzzled by the following. I have a DB with around 10 million rows, and (among other indices) on 1 column (campaignid_int) is an index.

Now I have 700k rows where the campaignid is indeed 3835

For all these rows, the connectionid is the same.

I just want to find out this connectionid.

 use messaging_db;
 SELECT     TOP (1) connectionid
 FROM         outgoing_messages WITH (NOLOCK)
 WHERE     (campaignid_int = 3835)

Now this query takes approx 30 seconds to perform!

I (with my small db knowledge) would expect that it would take any of the rows, and return me that connectionid

If I test this same query for a campaign which only has 1 entry, it goes really fast. So the index works.

How would I tackle this and why does this not work?

edit:

estimated execution plan:

select (0%) - top (0%) - clustered index scan (100%)
+1  A: 

If the campaignid_int column is not indexed, add an index to it. That should speed up the query. Right now I presume that you need to do a full table scan to find the matches for campaignid_int = 3835 before the top(1) row is returned (filtering occurs before results are returned).

EDIT: An index is already in place, but since SQL Server does a clustered index scan, the optimizer has ignored the index. This is probably due to (many) duplicate rows with the same campaignid_int value. You should consider indexing differently or query on a different column to get the connectionid you want.

Håvard S
I changed the question slightly. The campaignid_int is indexed
Toad
there is no way to trick it to just return any of the rows? I would guess that even though there are a lot of rows for this index, that it could still index to any of these rows. I just don't want it to traverse the full DB
Toad
Sorry, no. The main purpose of the index is fast lookup on unique values (i.e. keys), but when you have many duplicate values, an index seek (i.e. lookup) won't do, so the optimizer will issue an index scan. You will need to index differently or change your query.
Håvard S
A: 

Your query does not work as you expect, because Sql Server keeps statistics about your index and in this particular case knows that there are a lot of duplicate rows with the identifier 3835, hence it figures that it would make more sense to just do a full index (or table) scan. When you test for an ID which resolves to only one row, it uses the index as expected, i.e. performs an index seek (the execution plan should verify this guess).

Possible solutions ? Make the index composite, if you have anything to compose it with, that is, e.g. compose it with the date the message was sent (if I understand your case correctly) and then select the top 1 entry from the list with the specified id ordered by the date. Though I'm not sure whether this would be better (for one, a composite index takes up more space) - just a guess.

EDIT: I just tried out the suggestion of making the index composite by adding a date column. If you do that and specify order by date in your query, an index seek is performed as expected.

Saulius
+1  A: 

The index may be useless for 2 reasons:

  • 700k in 10 million may be not selective enough
  • and /or
  • connectionid needs included so the entire query can used only an index

Otherwise, the optimiser decides it may as well use the PK/clustered index to both filter on campaignid_int and get connectionid, to avoid a bookmark lookup on 700k rows from the current index.

So, I suggest this...

CREATE NONCLUSTERED INDEX IX_Foo ON MyTable (campaignid_int) INCLUDE (connectionid)
gbn
but since I'm specifying 'top(1)' it means: give me any row. Why would it first crawl through the 700k rows just to return one?
Toad
A: 

You aren't specifying an ORDER BY clause in your query, so the optimiser is not being instructed as to the sort order it should be selecting the top 1 from. SQL Server won't just take a random row, it will order the rows by something and take the top 1, and it may be choosing to order by something that is sub-optimal. I would suggest that you add an ORDER BY x clause, where x being the clustered key on that table will probably be the fastest.

This may not solve your problem -- in fact I'm not sure I expect it to from the statistics you've given -- but (a) it won't hurt, and (b) you'll be able to rule this out as a contributing factor.

Greg Beech
I actually tried order by recid (the primary key) and it was just as slow. =^(
Toad
I know nothing about MS-SQL-server but will "ORDER BY campaignid_int" meet the optimiser's requirement for ordering?
Adrian Pronk
@adrian: but to order it, wouldn't the db need to scan all entries to know which one comes at the top? I already know all value are the same, so it can stop at any row it finds.
Toad
@Toad: No, it shouldn't need to scan all rows. Since the WHERE clause stipulates that campaignid_int = 3835 the optimiser could know that the ORDER BY will be honoured without requiring the rows be scanned.
Adrian Pronk
A: 

This doesn't answer your question, but try using:

SET ROWCOUNT 1
SELECT     connectionid
 FROM         outgoing_messages WITH (NOLOCK)
 WHERE     (campaignid_int = 3835)

I've seen top(x) perform very badly in certain situations as well. I'm sure it's doing a full table scan. Perhaps your index on that particular column needs to be rebuilt? The above is worth a try, however.

ScottE
just as slow unfortunately
Toad
Bummer. I don't recall how we solved our query issue - if we adjusted the indices, or just took the top result in code instead of at the db level.
ScottE
+3  A: 

Hi reinier,

Due to the statistics, you should explicitly ask the optimizer to use the index you've created instead of the clustered one.

SELECT  TOP (1) connectionid
FROM    outgoing_messages WITH (NOLOCK, index(idx_connectionid))
WHERE  (campaignid_int = 3835)

I hope it will solve the issue.

Regards, Enrique

enrique
that solved it! Thanks!
Toad
A: 

but since I'm specifying 'top(1)' it means: give me any row. Why would it first crawl through the 700k rows just to return one? – reinier 30 mins ago

Sorry, can't comment yet but the answer here is that SQL server is not going to understand the human equivalent of "Bring me the first one you find" when it hears "Top 1". Instead of the expected "Give me any row" SQL Server goes and fetches the first of all found rows. Only time it knows that is after fetching all rows first, then discarding the rest. Very thorough but in your case not really fast.

Main issue as other said are your statistics and selectivity of your index. If you have another unique field in your table (like an identity column) then try an combined index on campaignid_int first, unique column second. As you only query on campaignid_int it has to be the first part of the key. Sounds worth a try as this index should have a higher selectivity thus the optimizer can use this better than doing an index crawl.