views:

2095

answers:

9

I have a sproc that puts 750K records into a temp table through a query as one of its first actions. If I create indexes on the temp table before filling it, the item takes about twice as long to run compared to when I index after filling the table. (The index is an integer in a single column, the table being indexed is just two columns each a single integer.)

This seems a little off to me, but then I don't have the firmest understanding of what goes on under the hood. Does anyone have an answer for this?

+1  A: 

this is because if the data you insert is not in the order of the index, SQL will have to split pages to make room for additional rows to keep them together logically

SQLMenace
+1  A: 

You should NEVER EVER create an index on an empty table if you are going to massively load it right afterwards. Indexes have to maintained as the data on the table changes, so imagine as if for every insert on the table the index was being recalculated (which is an expensive operation). Load the table first and create the index after finishing with the load. That's were the performance difference is going.

Iker Jimenez
+2  A: 

After performing large data manipulation operations, you frequently have to update the underlying indexes. You can do that by using the UPDATE STATISTICS [table] statement.

The other option is to drop and recreate the index which, if you are doing large data insertions, will likely perform the inserts much faster. You can even incorporate that into your stored procedure.

palehorse
+7  A: 

If you create a clustered index, it affects the way the data is physically ordered on the disk. It's better to add the index after the fact and let the database engine reorder the rows when it knows how the data is distributed.

For example, let's say you needed to build a brick wall with numbered bricks so that those with the highest number are at the bottom of the wall. It would be a difficult task if you were just handed the bricks in random order, one at a time - you wouldn't know which bricks were going to turn out to be the highest numbered, and you'd have to tear the wall down and rebuild it over and over. It would be a lot easier to handle that task if you had all the bricks lined up in front of you, and could organize your work.

That's how it is for the database engine - if you let it know about the whole job, it can be much more efficient than if you just feed it a row at a time.

Jon Galloway
+5  A: 

It's because the database server has to do calculations each and every time you insert a new row. Basically, you end up reindexing the table each time. It doesn't seem like a very expensive operation, and it's not, but when you do that many of them together, you start to see the impact. That's why you usually want to index after you've populated your rows, since it will just be a one-time cost.

tghw
+1  A: 

This due to the fact that when SQL Server indexes table with data it is able to produce exact statistics of values in indexed column. At some moments SQL Server will recalculate statistics, but when you perform massive inserts the distribution of values may change after the statistics was calculated last time.

The fact that statistics is out of date can be discovered on Query Analyzer. When you see that on a certain table scan number of rows expected differs to much from actual numbers of rows processed.

You should use UPDATE STATISTICS to recalculate distribution of values after you insert all the data. After that no performance difference should be observed.

Dima Malenko
+3  A: 

Think of it this way.

Given
unorderedList = {5, 1,3}
orderedList = {1,3,5}

add 2 to both lists.
unorderedList = {5, 1,3,2}
orderedList = {1,2,3,5}

What list do you think is easier to add to?

Btw ordering your input before load will give you a boost.

jms
+1  A: 

If you have an index on a table, as you add data to the table SQL Server will have to re-order the table to make room in the appropriate place for the new records. If you're adding a lot of data, it will have to reorder it over and over again. By creating an index only after the data is loaded, the re-order only needs to happen once.

Of course, if you are importing the records in index order it shouldn't matter so much.

Joel Coehoorn
+1  A: 

In addition to the index overhead, running each query as a transaction is a bad idea for the same reason. If you run chunks of inserts (say 100) within 1 explicit transaction, you should also see a performance increase.

Dana the Sane