views:

121

answers:

4

Hi,

When creating indexes for an SQL table,if i had an index on 2 columns in the table and i changed the index to be on 4 columns in the table, what would be a reasonable increase the time taken to save say 1 million rows to expect?

I know that the answer to this question will vary depending on a lot of factors, such as foreign keys, other indexes, etc, but I thought I'd ask anyway. Not sure if it matters, but I am using MS SQLServer 2005.


EDIT: Ok, so here's some more information that might help get a better answer. I have a table called CostDependency. Inside this table are the following columns:

CostDependancyID as UniqueIdentifier (PK)
ParentPriceID as UniqueIdentifier (FK)
DependantPriceID as UniqueIdentifier (FK)
LocationID as UniqueIdentifier (FK)
DistributionID as UniqueIdentifier (FK)
IsValid as Bit

At the moment there is one Unique index involving ParentPriceID, DependantPriceID, LocationID and DistributionID. The reason for this index is to guarantee that the combination of those four columns is unique. We are not doing any searching on these four columns together. I can however normalise this table and make it into three tables:

CostDependancyID as UniqueIdentifier (PK)
ParentPriceID as UniqueIdentifier (FK)
DependantPriceID as UniqueIdentifier (FK)

Unique Index on ParentPriceID and DependantPriceID

and

ExtensionID as UniqueIdentifier (PK)
CostDependencyID (FK)
DistributionID as UniqueIdentifier (FK)

Unique Index on CostDependencyID and DistributionID

and

ID as UniqueIdentifier (PK)
ExtensionID as UniqueIdentifier (FK)
LocationID as UniqueIdentifier (FK)
IsValid as Bit

Unique Index on ExtensionID and LocationID

I am trying to work out if normalising this table and thus reducing the number of columns in the indexes will mean speed improvements when adding a large number of rows (i.e. 1 million).


Thanks, Dane.

+1  A: 

It depends pretty much on whether the wider index forms a covering index for your queries (and to a lesser extent the ratio of read to writes on that table). Suggest you post your execution plan(s) for the query workload you are trying to improve.

Mitch Wheat
A: 

The query optimizer will look at the index and determine if it can use the leading column. If the first column is not in the query, then it won't be used period. If the index can be used, then it will check to see if the second column can be used. If your query contains 'where A=? and C=?' and you index is on A,B,C,D then only the 'A' column will be used in the query plan.

Adding columns to an index can be useful sometimes to avoid the database having to go from the index page to the data page. If you query is 'select D from table where a=? and b=? and c=?', then column 'D' will be returned from the index, and save you a bit of IO in having to go to the data page.

brianegge
+1  A: 

I'm a bit confused over your goals. The (post-edit) question reads that you're trying to optimize data (row) insertion, comparing one table of six columns and a four-column compound primary key against a "normalized" set of three tables of three or four columns each, and each of the three with a two-column compound key. Is this your issue?

My first question is, what are the effects of the "normalization" from one table to three? If you had 1M rows in the single table, how many rows are you likely to have in the three normalized ones? Normalization usually removes redundant data, does it do so here?

Inserting 1M rows into a four-column-PK table will take more time than into a two-column-PK table--perhaps a little, perhaps a lot (see next paragraph). However, if all else is equal, I believe that inserting 1M rows into three two-column-PK tables will be slower than the four column one. Testing is called for.

One thing that is certain is that if the data to be inserted is not loaded in the same order as it will be stored in, it will be a LOT slower than if the data being inserted were already sorted. Multiply that by three, and you'll have a long wait indeed. The most common work-around to this problem is to drop the index, load the data, and then recreate the index (sounds like a time-waster, but for large data sets it can be faster than inserting into an indexed table). A more abstract work-around is to load the table into an unindexed partition, (re)build the index, then switch the partition into the "live" table. Is this an option you can consider?

By and large, people are not overly concerned with performance when data is being inserted--generally they sweat over data retrieval performance. If it's not a warehouse situation, I'd be interested in knowing why insert performance is your apparent bottleneck.

Philip Kelley
The main reason for this question is that the requirements behind this table have recently changed so that now instead of inserting 500 rows in each transaction we are inserting 50,000 rows. As well as that, the index has increased from being across 2 columns to being across 4 columns. We have noticed a huge increase in time taken to save rows - e.g. 0.2 secs to save 500 rows compared with 15 mins to save 50,000 rows. We are trying to work out why this is.Could you explain a bit more about the partitioning solution?
link664
Table partitioning is a long and complex topic. Kimberly Tripp's article at http://msdn.microsoft.com/en-us/library/ms345146(SQL.90).aspx is a good place to start.
Philip Kelley
+1  A: 

With all the new info available, I'd like to suggest the following:

1) If a few of the GUID (UniqueIdentifier) columns are such that a) there are relatively few different values and b) there are relatively few new values added after the initial load. (For example the LocationID may represents a store, and if we only see a few new stores every day), it would be profitable to spin off these to a separate lookup table(s) GUID ->LocalId (an INT or some small column), and use this LocalId in the main table. ==> Doing so will greatly reduce the overall size of the main table and its associated indexes, at the cost of slightly complicating the update logic (but not not its performance), because of the lookup(s) and the need to maintain the lookup table(s) with new values.

2) Unless a particular important/frequent search case could make [good] use of a clustered index, we could use the clustered index on the main table to be for the 4 columns-based unique composite key. This would avoid replicating that much data in a separate non-clustered index, and as counter intuitive at is seems, it would save time for the initial load and with new inserts. The trick would be to use a relatively low fillfactor so that node splitting and balancing etc. would be infrequent. BTW, if we make the main record narrower with the use of local IDs, we can more readily afford "wasting" space in the fillfactor, and more new record will fit in this space before requiring node balancing.

3) link664 could provide an order of magnitude for the total number of records in "main" table and the number expected daily/weekly/whenever updates are scheduled. And these two parameter could confirm the validity of the approach suggested above, as well as provide hints as to the possibility of maybe dropping the indexes (or some of them) prior to big batch inserts, as suggested by Philip Kelley. Doing so however would be contingent to operational considerations such as the need to continue search service while new data is inserted.

4) other considerations such as SQL partitioning, storage architecture etc. can also be put to work to improve load and/or retrieve performance.

mjv
Moving the clustered index from the ID table to the 4 column-based composite key more than halved the time it took in most cases. Also managed to work out how to drop one of the columns out of the key so it's much better now (e.g. takes 30 secs rather than a 1 min 50 secs to insert 900k rows). Thanks mjv!
link664