Clustered indexes on non-identity columns to speed up bulk inserts?

views:

111

answers:

+5 Q:

Clustered indexes on non-identity columns to speed up bulk inserts?

My two questions are:

Can I use clustered indexes to speed up bulk inserts in big tables?
Can I then still efficiently use foreign key relationships if my IDENTITY column is not the clustered index anymore?

To elaborate, I have a database with a couple of very big (between 100-1000 mln rows) tables containing company data. Typically there is data about 20-40 companies in such a table, each as their own "chunk" marked by "CompanyIdentifier" (INT). Also, every company has about 20 departments, each with their own "subchunk" marked by "DepartmentIdentifier" (INT).

It frequently happens that a whole "chunk" or "subchunk" is added or removed from the table. My first thought was to use Table Partitioning on those chunks, but since I am using SQL Server 2008 Standard Edition I am not entitled to it. Still, most queries I have are executed on a "chunk" or "subchunk" rather than on the table as a whole.

I have been working to optimize these tables for the following functions:

Queries that are run on subchunks
"Benchmarking" queries that are run on the table as a whole
Inserting/removing big chunks of data.

For 1) and 2) I haven't encountered a lot of problems. I have created several indexes on key fields (also containing CompanyIdentifier and DepartmentIdentifier where useful) and the queries are running fine.

But for 3) I have struggled to find a good solution. My first strategy was to always disable indexes, bulk insert a big chunk and rebuild indexes. This was very fast in the beginning, but now that there are a lot of companies in the database, it takes a very long time to rebuild the index each time.

At the moment my strategy has changed to just leaving the index on while inserting, since this seems to be faster now. But I want to optimize the insert speed even further.

I seem to have noticed that by adding a clustered index defined on CompanyIdentifier + DepartmentIdentifier, the loading of new "chunks" into the table is faster. Before I had abandoned this strategy in favour of adding a clustered index on an IDENTITY column, as several articles pointed out to me that the clustered index is contained in all other indexes and so the clustered index should be as small as possible. But now I am thinking of reviving this old strategy to speed up the inserts. My question, would this be wise, or will I suffer performance hits in other areas? And will this really speed up my inserts or is that just my imagination?

I am also not sure whether in my case an IDENTITY column is really needed. I would like to be able to establish foreign key relationships with other tables, but can I also use something like a CompanyIdentifier+DepartmentIdentifier+[uniquifier] scheme for that? Or does it have to be a table-wide, fragmented IDENTITY number?

Thanks a lot for any suggestions or explanations.

+1 A:

A clustered index is a physical index, a physical data structure, a row order. If you insert in the middle of the clustered index, the data will be physically inserted in the middle of the present data. I imagine a serious performance issue in this case. I only know this from theory, because if I do this in practice, it will be a mistake according to my theoretical knowledge.

Therefore, I only use (and advise the use) of clustered indexes on fields that are always, physically, inserted at the end, preserving the order.

A clustered index can be placed on a datetime field which marks the moment of insertion or something like that, because physically they will be ordered after appending a row. Identity is a good clustered index also, but not always relevant for querying.

In your solution you place a [uniquifier] field, but why do this when you can put an identity that will do just that? It will be unique, physically ordered, small (for foreign keys in other tables means smaller index), and in some cases faster.

Can't you try this, experiment? I have a similar situation here, where I have 4 billion rows, constantly more are inserting (up to 100 per second), the table has no primary key and no clustered index, so the propositions in this topic are VERY interesting for me too.

Alexander 2010-09-17 08:43:10

Clustered indexes are only in physical order if there is zero fragmentation. It is possible for the logical and physical order to be different.

Martin Smith 2010-09-17 08:57:22

Thank you for that correction. Am I enforcing lack of fragmentation by using these methods? How bad is fragmentation anyway?

Alexander 2010-09-17 09:36:18

+1 A:

Can I use clustered indexes to speed up bulk inserts in big tables?

Never! Imagine another million rows that you need to put in that table and have them physically ordered it is a colossal loss in performance in the long run.

Can I then still efficiently use foreign key relationships if my IDENTITY column is not the clustered index anymore?

Absolutely. By the way, clustered index is no silver bullet and may be slower than your ordinary index.

Denis Valeev 2010-09-17 09:02:27

Do you disagree with [this statement](http://www.simple-talk.com/sql/database-administration/brads-sure-guide-to-indexes/)? "It is faster to insert rows into a table with a clustered index as the primary key than it is to insert the same data into a heap that has a non-clustered index as its primary key. This is true whether or not the primary key is monotonically increasing or not."

Martin Smith 2010-09-17 09:17:18

Why *wouldn't* I want to have them physically ordered on Company/Department? I am only adding chunks based on that combination, I will not be adding mixed sets (ever). So to add them I would only have to touch one physical part of the harddisk. Furthermore, I have a lot of queries that run only on a unique subset based on those columns.

littlegreen 2010-09-17 09:33:11

@Martin Smith, @littlegreen I'm talking about data warehousing where you have to deal with millions of records that have to get inserted among existing data if there's clustered index on that table. And when you remove that clustered index, this data get appended to the end of the table which is obviously faster.

Denis Valeev 2010-09-17 10:19:25

So you would remove the clustered index altogether, not even on the IDENTITY column?

littlegreen 2010-09-17 10:39:06

@littlegreen I would and an existing clustered index really defeats the purpose of this bulk insert operation.

Denis Valeev 2010-09-17 10:47:41

i've been playing around with some etl stuff the last little bit. i went through jsut regularly inserting into the table, then removing and readding indexes before and after the insert, tried merge statements, then i finally tried ssis. I'm sold on ssis. Just yesterday i managed to cut an etl process (~24 million records, ~6gb) from ~1-1 1/2 hours per run to ~24 minutes, jsut by letting ssis handle the inserts.

i believe with advanced services you should be able to use ssis.

DForck42 2010-09-17 15:33:25

As far as I know, SSIS is not faster than doing a BULK INSERT operation.

littlegreen 2010-09-29 21:29:18

cool. i haven't messed around with bulk insert.

DForck42 2010-09-30 18:09:16

Well, I've put it to the test, and putting a clustered index on the two "chunk-defining" columns increases the performance of my table.

Inserting a chunk is now relatively fast compared to the situation where I had a clustered IDENTITY key, and about as fast as when I did not have any clustered index. Deleting a chunk is faster than with or without clustered index.

I think the fact that all the records I want to delete or insert are guaranteed to be all together on a certain part of the harddisk makes the tables faster - it would seem logical to me.

littlegreen 2010-09-29 21:34:40

Have a look at the System.Data.SqlClient.SqlBulkCopy API. Given your requirements to write signficant numbers of rows in and out of the database, it might be what you need?

Bulk copy streams the data into the table in a single operation then performs the index check once. I use it to copy 500,000 rows in and out of a database table and it's performance is an order of magnitude better than any other technique I've tried, assuming that your application can be structured to take use of the API?

Spence 2010-09-29 21:39:51

As far as I know, both SSIS and the BULK INSERT operation make use of the same technique as this API. I am using SSIS now when reading from files, and when copying between tables I just use normal SQL. Can this API copy between tables?

littlegreen 2010-09-30 10:38:32

ansaurus

tags:

views:

answers:

Clustered indexes on non-identity columns to speed up bulk inserts?

related questions