views:

547

answers:

12

We're having a problem where indexes on our tables are being ignored and SQL Server 2000 is performing table scans instead. We can force the use of indexes by using the WITH (INDEX=<index_name>) clause but would prefer not to have to do this.

As a developer I'm very familiar with SQL Server when writing T-SQL, but profiling and performance tuning isn't my strong point. I'm looking for any advice and guidance as to why this might be happening.

Update:

I should have said that we've rebuilt all indexes and updated index statistics.

The table definition for one of the culprits is as follows:

CREATE TABLE [tblinvoices]
(
    [CustomerID] [int] NOT NULL,
    [InvoiceNo] [int] NOT NULL,
    [InvoiceDate] [smalldatetime] NOT NULL,
    [InvoiceTotal] [numeric](18, 2) NOT NULL,
    [AmountPaid] [numeric](18, 2) NULL 
     CONSTRAINT [DF_tblinvoices_AmountPaid]  DEFAULT (0),
    [DateEntered] [smalldatetime] NULL 
     CONSTRAINT [DF_tblinvoices_DateEntered]  DEFAULT (getdate()),
    [PaymentRef] [varchar](110),
    [PaymentType] [varchar](10),
    [SyncStatus] [int] NULL,
    [PeriodStart] [smalldatetime] NULL,
    [DateIssued] [smalldatetime] NULL 
     CONSTRAINT [DF_tblinvoices_dateissued]  DEFAULT (getdate()),
    CONSTRAINT [PK_tblinvoices] PRIMARY KEY NONCLUSTERED 
    (
     [InvoiceNo] ASC
    ) ON [PRIMARY]
) ON [PRIMARY]

There is one other index on this table (the one we want SQL to use):

CustomerID (Non-Unique, Non-Clustered)

The following query performs a table scan instead of using the CustomerID index:

SELECT 
    CustomerID, 
    Sum(InvoiceTotal) AS SumOfInvoiceTotal, 
    Sum(AmountPaid) AS SumOfAmountPaid 
FROM tblInvoices 
WHERE CustomerID = 2112 
GROUP BY customerID

Updated:

In answer to Autocracy's question, both of those queries perform a table scan.

Updated:

In answer to Quassnoi's question about DBCC SHOW_STATISTICS, the data is:

RANGE_HI_KEY    RANGE_ROWS    EQ_ROWS    DISTINCT_RANGE_ROWS    AVG_RANGE_ROWS
1667            246           454        8                      27.33333
2112            911           3427       16                     56.9375
2133            914           775        16                     57.125
+3  A: 

The most common reasons for indexes to be ignored are

  • Columns involved are not selective enough (optimiser decides tables scans will be faster, due to 'visiting' a large amount of rows)

  • There are a large number of columns involved in SELECT/GROUP BY/ORDER BY and would involve a lookup into the clustered index after using the index

  • Statistics being out of date (or skewed by a large number of inserts or deletes)

Do you have a regular index maintenance job running? (it is quite common for it to be missing in Dev environment).

Mitch Wheat
+1 for selectiveness - only if an index will pick roughly less than 1-5% for a given value will it really be used. E.g. putting an index on a "gender" column or a "bit" column really doesn't make a whole lot of sense.
marc_s
I think the rule of thumb for SQL Server is 10% (approx.)
Mitch Wheat
There are various numbers floating around - it definitely doesn't work with 50% selectiveness! (e.g. "gender")
marc_s
Also depends on the size of the rows - see details here: http://www.sqlskills.com/BLOGS/KIMBERLY/category/Nonclustered-Indexes.aspx
marc_s
@marc_s: nice article. I really must start reading kimberly's blog again
Mitch Wheat
@Mitch - thanks for the info. I'll pass to our DBA chap.
Kev
@Marc_s: thanks for the Kimberly link.
Kev
@Mitch - tblInvoices grows by about 100-150 rows per day. It has ~260,000 rows. The bulk of the rows are added by a daily invoicing run, the remainder (say 30%) are added adhoc during the business day. Not sure if that helps.
Kev
A: 

You could also try doing an UPDATE STATISTICS on the table (or tables) involved in the query. Not that I fully understand statistics in SQL, but I do know it is something our DBAs do occasionally (with an weekly job being scheduled to update stats on the larger and frequently changed tables).

SQL Statistics

Tim C
+5  A: 

We're having a problem where indexes on our tables are being ignored and SQL Server 2000 is performing table scans instead.

Despite 4,302 days that have passed since Aug 29, 1997, SQL Server's optimizer has not evolved into SkyNet yet, and it still can make some incorrect decisions.

Index hints are just the way you, a human being, help the artificial intelligence.

If you are sure that you collected statistics and the optimizer is still wrong, then go on, use the hints.

They are legitimate, correct, documented and supported by Microsoft way to enforce the query plan you want.

In your case:

SELECT CustomerID, 
       SUM(InvoiceTotal) AS SumOfInvoiceTotal, 
       SUM(AmountPaid) AS SumOfAmountPaid 
FROM   tblInvoices 
WHERE  CustomerID = 2112 
GROUP BY
       CustomerID

, the optimizer has two choises:

  • Use the index which implies a nested loop over the index along with KEY LOOKUP to fetch the values of InvoiceTotal and AmountPaid
  • Do not use the index and scan all tables rows, which is faster in rows fetched per second, but longer in terms of total row count.

The first method may or may not be faster than the second one.

The optimizer tries to estimate which method is faster by looking into the statistics, which keep the index selectivity along with other values.

For selective indexes, the former method is faster; for non-selective ones, the latter is.

Could you please run this query:

SELECT  1 - CAST(COUNT(NULLIF(CustomerID, 2112)) AS FLOAT) / COUNT(*)
FROM    tlbInvoices

Update:

Since CustomerID = 2112 covers only 1,4% of your rows, you should benefit from using the index.

Now, could you please run the following query:

DBCC SHOW_STATISTICS ([tblinvoices], [CustomerID])

, locate two adjacents rows in the third resultset with RANGE_HI_KEY being less and more than 2112, and post the rows here?

Update 2:

Since the statistics seem to be correct, we can only guess why the optimizer chooses full table scan in this case.

Probably (probably) this is because this very value (2112) occurs in the RANGE_HI_KEY and the optimizer sees that it's unusually dense (3427 values for 2112 alone against only 911 for the whole range from 1668 to 2111)

Could you please do two more things:

  1. Run this query:

    DBCC SHOW_STATISTICS ([tblinvoices], [CustomerID])
    

    and post the first two resultsets.

  2. Run this query:

    SELECT  TOP 1 CustomerID, COUNT(*)
    FROM    tblinvoices
    WHERE   CustomerID BETWEEN 1668 AND 2111
    

    , use the top CustomerID from the query above in your original query:

    SELECT CustomerID, 
           SUM(InvoiceTotal) AS SumOfInvoiceTotal, 
           SUM(AmountPaid) AS SumOfAmountPaid 
    FROM   tblInvoices 
    WHERE  CustomerID = @Top_Customer
    GROUP BY
           CustomerID
    

    and see what plan will it generate.

Quassnoi
Index hints should be an absolute last resort, and only used in exceptional circumstances.
Mitch Wheat
It's not SkyNet, but even on 2000 it's very good. My bet would be on missing statistics
erikkallen
@Mitch: I understand you right, the optimizer never mistakes?
Quassnoi
@Quassnoi: I'm not saying it is NEVER wrong; just rarely. Index Hints should be used as a last resort, IMO.
Mitch Wheat
@Mitch: you just tell me: is it OK to use the hints when the optimizer is wrong or not?
Quassnoi
@Quassnoi: we don't know for sure in this case whether the optimser is wrong or not (and I would put my money on NOT), so index hints are not appropriate (yet)
Mitch Wheat
+1 for referencing SkyNet and for calculating how many days. Wait, you used a calendar table for that right?
esabine
@esabine: SELECT DATEDIFF(SYSDATE(), CAST('1997-08-29' AS DATE))
Quassnoi
@Quassnoi: In SQL Server, Index hints should be a last resort.
Mitch Wheat
@Mitch: OK, as you say.
Quassnoi
@Quassnoi: unless of course, you know the selectivity of your data, FKs and indexes, and know that they aren't likely to change quickly, then perhaps INDEX hints are the way to go, but this is an expert (not developer) area (no offence to devs; I'm one!)
Mitch Wheat
@Quassnio: the result is - 0.014857559059556
Kev
@Quassnio: I've added this data to my question because the comments don't format that well. Appreciated, Kev.
Kev
@Quassnoi - sorry for getting the spelling of your name wrong mate :). Looks like the issue is due to a) lack of clustered index or b) covering index as in BradC's answer. We tried a covering index for now and performance has massively improved. I'm gonna leave my DBA to create a clustered index during quiet time this week. Thank very much for all the time you spent on this. I wish I could give correct answers to everyone because every little bit helped. Again, much appreciated.
Kev
A: 

Try updating your statistics. These statistics are the basis for the decisions made by the compiler about whether it should use an index or not. They contain information such as cardinality and number of rows for each table.

E.g., if the statistics have not been updated since you did a large bulk import, the compiler may still think the table has only 10 rows in it, and not bother with an index.

RedFilter
A: 

Are you using "SELECT * FROM ..."? This generally results in scans.

We'd need schema, indexes and sample queries to help more

gbn
+2  A: 

Latest post from Kimberly covers exactly this topic: http://www.sqlskills.com/BLOGS/KIMBERLY/post/The-Tipping-Point-Query-Answers.aspx

SQL Server uses a cost based optimizer and if the optimizer calculates that the cost of looking up the index keys and then look up the clustered index to retrieve the rest of the columns is higher than the cost of scanning the table, then it will scan the table instead. The 'tipping' point is actually surprisingly low.

Remus Rusanu
+1  A: 

Several others have pointed out that your database may need the index statistics updated. You may also have such a high percentage of rows in the database that it would be faster to sequentially read the table than to seek across the disk to find every one. SQL Server has a fancy GUI query analyzer that will tell you what the database thinks the cost of various activiites is. You can open that up and see exactly what it was thinking.

We can give you more solid answers if you can give us:

Select * from tblinvoices;
Select * from tblinvoices where CustomerID = 2112;

Use that query analyzer, and update your statistics. One last hint: you can use index hints to force it to use your index if you're sure it's just being stupid after you've done everything else.

Autocracy
+2  A: 

Have you tried adding the other columns to your index? i.e. InvoiceTotal and AmountPaid.

The idea being that the query will be "covered" by the index, and won't have to refer back to the table.

Chris Needham
+1  A: 

Have you tried

exec sp_recompile tblInvoices

...just to make sure you're not using a cached bad plan?

SqlACID
+2  A: 

I would start testing to see if you can change the primary key to a clustered index. Right now the table is considered a "heap". If you can't do this then I would also consider creating a view with a clustered index but first you'd have to change the "AmountPaid" column to NOT NULL. It already defaults to zero so this might be an easy change. For the view I'd try something similar to this.

SET QUOTED_IDENTIFIER, ANSI_NULLS, ANSI_PADDING, ANSI_WARNINGS, ARITHABORT, CONCAT_NULL_YIELDS_NULL, QUOTED_IDENTIFIER ON
GO
SET NUMERIC_ROUNDABORT OFF
GO

IF EXISTS 
  (
         SELECT TABLE_NAME
           FROM INFORMATION_SCHEMA.VIEWS 
          WHERE TABLE_NAME = N'CustomerInvoiceSummary'
  )
           DROP VIEW dbo.CustomerInvoiceSummary
GO

CREATE VIEW dbo.CustomerInvoiceSummary WITH SCHEMABINDING
AS

  SELECT a.CustomerID
       , Sum(a.InvoiceTotal) AS SumOfInvoiceTotal
       , Sum(a.AmountPaid)   AS SumOfAmountPaid 
       , COUNT_BIG(*)                     AS CT
    FROM dbo.tblInvoices a
GROUP BY a.CustomerID

GO
CREATE UNIQUE CLUSTERED INDEX CustomerInvoiceSummary_CLI ON dbo.CustomerInvoiceSummary ( CustomerID )
GO
esabine
Esabine - thanks very much for the input and time spent on this. I'm gonna go with BradC's answer as the covering index has righted a lot of wrongs for us.
Kev
That's great to hear you've got a potential solution and thanks for the feedback. Just realized too - what I failed to mention above is when you run the original query the optimizer will consider using the index on the view (CustomerInvoiceSummary_CLI), even though you don't actually query the view directly.
esabine
+2  A: 

I think I just found it. I was reading the comments posted to your question before I noted that the two queries I gave you were expected to cause table scan, and I just wanted the result. That said, it caught my interest when somebody said you had no clustered indexes. I read your SQL create statement in detail, and was surprised to note that was the case. This is why it isn't using your CustomerId index.

Your CustomerId index references your primary key of InvoiceNo. Your primary key, however, isn't clustered, so then you'd have to look in that index to find where the row actually is. The SQL server won't do two non-clustered index lookups to find a row. It'll just table scan.

Make your InvoiceNo a clustered index. We can assume those will generally be inserted in ascending manner, and thus the insertion cost won't be much higher. Your query cost, however, will be much lower. Dollars to donuts, it'll use your index then.


Edit: I like BradC's suggestion as well. It's a common DBA trick. Like he says, though, make that primary clustered anyway since this is the CAUSE of your problem. It is very rare to have a table with no clustered index. Most of the time it isn't used, it's a bad idea. That said, his covering index is an improvement ON TOP OF clustering that should be done.

Autocracy
+2  A: 

The best thing to do is make the index a covering index by including the InvoiceTotal and AmountPaid columns in the CustomerID index. (In SQL 2005, you would add them as "included" columns". In SQL 2000, you have to add them as additional key columns.) If you do that, I'll guarantee the query optimizer will choose your index*.

Explanation: Indexes seem like they would always be useful, but there is a hidden cost to using a (non-covering) index, and that is the "bookmark lookup" that has to be done to retrieve any other columns that might be needed from the main table. This bookmark lookup is an expensive operation, and is (one possible) reason why the query optimizer might not choose to use your index.

By including all needed columns in the index itself, this bookmark lookup is avoided entirely, and the optimizer doesn't have to play this little game of figuring out if using an index is "worth it".

(*) Or I'll refund your StackOverflow points. Just send a self-addressed, stamped envelope to...

Edit: Yes, if your primary key is NOT a clustered index, then by all means, do that, too!! But even with that change, making your CustomerID index a covering index should increase performance by an order of magnitude (10x or better)!!

BradC
Very true. If this is a regularly run query, the space used for this larger index will pay off handsomely. Further, anything else that looks up CustomerId will still be able to use the index, so you won't have two of them wandering around. +1
Autocracy
Brad - my DBA says that having a covering index have markedly improved life for him. Will look into whether the clustered index will add any benefits.
Kev
Wish I could give more +15's to everyone who helped out here.
Kev
Glad this helped. Would love to hear what kind of performance increase this gave your query.
BradC
Last comment from DBA was that we went from 68000 reads to 101 and CPU time 1047ms to 15ms (I'm guessing gathered from SQL Profiler data - I didn't see the original execution plan data).
Kev
Hey, I said that! Not quite as well though.
Chris Needham