views:

306

answers:

3

Hello

I have this query:

SELECT *
FROM sample
   INNER JOIN test ON sample.sample_number = test.sample_number
   INNER JOIN result ON test.test_number = result.test_number
   WHERE sampled_date BETWEEN '2010-03-17 09:00' AND '2010-03-17 12:00'

the biggest table here is RESULT, contains 11.1M records. The left 2 tables about 1M.

this query works slowly (more than 10 minutes) and returns about 800 records. executing plan shows clustered index scan (over it's PRIMARY KEY (result.result_number, which actually doesn't take part in query)) over all 11M records. RESULT.TEST_NUMBER is a clustered primary key.

if I change 2010-03-17 09:00 to 2010-03-17 10:00 - i get about 40 records. it executes for 300ms. and plan shows index seek (over result.test_number index)

if i replace * in SELECT clause to result.test_number (covered with index) - then all become fast in first case too. this points to hdd IO issues, but doesn't clarifies changing plan.

so, any ideas?

UPDATE: sampled_date is in table sample and covered by index. other fields from this query: test.sample_number is covered by index and result.test_number too.

UPDATE 2: obviously than sql server in any reasons don't want to use index.

i did a small experiment: i remove INNER JOIN with result, select all test.test_number and after that do

SELECT * FROM RESULT WHERE TEST_NUMBER IN (...)

this, of course, works fast. but i cannot get what is the difference and why query optimizer choose such inappropriate way to select data in 1st case.

UPDATE 3: after backing up database and restoring to database with new name - both requests work fast as expected even on much more ranges...

so - are there any special commands to clean or optimize, whatever, that can be relevant to this? :-(

A: 

The first thing I would do is specify the exact columns I want, and see if the problems persists. I doubt you would need all the columns from all three tables.

It sounds like it has trouble getting all the rows out of the result table. How big is a row? Look at how big all the data in the table is and divide it by the number of rows. Right click on the table -> properties..., Storage tab.

Try putting where clause into a subquery to force it to do that first?

SELECT *
FROM 
    (SELECT * FROM sample 
     WHERE sampled_date 
     BETWEEN '2010-03-17 09:00' AND '2010-03-17 12:00') s
     INNER JOIN test ON s.sample_number = test.sample_number
     INNER JOIN result ON test.test_number = result.test_number

OR this might work better if you expect a small number of samples

SELECT *
FROM sample
   INNER JOIN test ON sample.sample_number = test.sample_number
   INNER JOIN result ON test.test_number = result.test_number
WHERE sample.sample_ID in (
    SELECT sample_ID
    FROM sample
    WHERE sampled_date BETWEEN '2010-03-17 09:00' AND '2010-03-17 12:00'
)
Robert Wagner
1. it's already covered with index. troubles in RESULT table, not in SAMPLE2. 800 vs 40 (1st and 2nd query accordingly). if i specify any field different to RESULT.TEST_NUMBER (SELECT RESULT.NAME FROM ...) for example - i get a slow query and plan with clustered index scan
zerkms
this query get the same results (slow too). but this is slow even for 10-12 hrs interval.
zerkms
Check the size of the row, see above. Also look to see if there is large amounts of data in the 9am-10am timeframe.
Robert Wagner
the last query will not help just because as i said trouble is in RESULT, not in a SAMPLE
zerkms
A: 

If you do a SELECT *, you want all the data from the table. The data for the table is in the clustered index - the leaf nodes of the clustered index are the data pages.

So if you want all of those data pages anyway, and since you're joining 1 mio. rows to 11 mio. rows (1 out of 11 isn't very selective for SQL Server), using an index to find the rows, and then do bookmark lookups into the actual data pages for each of those rows found, might just not be very efficient, and thus SQL Server uses the clustered index scan instead.

So to make a long story short: only select those rows you really need! You thus give SQL Server a chance to use an index, do a seek there, and find the necessary data.

If you only select three, four columns, then the chances that SQL Server will find and use an index that contains those columns are just so much higher than if you ask for all the data from all the tables involved.

Another option would be to try and find a way to express a subquery, using e.g. a Common Table Expression, that would grab data from the two smaller tables, and reduce that number of rows even more, and join the hopefully quite small result against the main table. If you have a small result set of only 40 or 800 results (rather than two tables with 1 mio. rows each), then SQL Server might be more inclined to use a Clustered Index Seek and do bookmark lookups on 40 or 800 rows, rather than doing a full Clustered Index Scan.

marc_s
"So if you want all of those data pages anyway, and since you're joining 1 mio. rows to 11 mio. rows" I'm joining about 20 rows (date range limits this) to the 11M rows with index. anyway this doesn't explains why changing range a little changes plan. why sql can use index for 20 rows and cannot to 800?
zerkms
"So to make a long story short: only select those rows you really need! You thus give SQL Server a chance to use an index, do a seek there, and find the necessary data." So if I enumerate all 20 fields that i need from RESULT table what changes should be done with schema?
zerkms
@zerkms: yes, the difference between selecting 20 or 800 rows can make a **huge** difference in how the query optimizer picks its plan! Maybe, up to 150 rows, doing the bookmark lookups will be faster than the clustered index scan, and then the scale tips. Also having up to date statistics can make a huge difference - try to update your statistics on the tables and indices involved as a first measure
marc_s
yep, scan on 11M is of course faster than index seeking on 800 rows... (ironic)... as i said at UPD3 yeah - i need some statistics (or something service command else) updating. but cannot google it :-(
zerkms
+1  A: 

A couple things to try:

  • Update statistics
  • Add hints to the query about what index to use (in SQL Server you might say WITH (INDEX(myindex)) after specifying a table)

EDIT: You noted that copying the database made it work, which tells me that the index statistics were out of date. You can update them with something like UPDATE STATISTICS mytable on a regular basis.

Use EXEC sp_updatestats to update the whole database.

Gabe
yep, according to UPD3 at question: some command i need in to solve this ;-)
zerkms
yep, i will try this tomorrow when come to work. thank you. and may be any same methods to update stats on whole database?
zerkms