views:

28

answers:

1

I am trying to convert tables from using guid primary keys / clustered indexes to using int identities. This is for SQL Server 2005. There are two tables MainTable and RelatedTable, and the current table structure is as follows:

MainTable [40 million rows]

IDGuid - uniqueidentifier - PK
-- [data columns]

RelatedTable [400 million rows]

RelatedTableID - uniqueidentifier - PK
MainTableIDGuid - uniqueidentifier [foreign key to MainTable]
SequenceNumber - int - incrementing number per main table entry since there can be multiple entries related to a given row in the main table. These go from 1,2,3... etc for each MainTableIDGuid value.
-- [data columns]

The clustered index for MainTable is currently the primary key (IDGuid). The clustered index for RelatedTable is currently (MainTableIDGuid, SequenceNumber).

I want my conversion is do several things:<

  1. Change MainTable to use an integer ID instead of GUID
  2. Add a MainTableIDInt column to related table that links to Main Table's integer ID
  3. Change the primary key and clustered index of RelatedTable to (MainTableIDInt, SequenceNumber)
  4. Get rid of the guid columns.

I've written a script to do the following:

  1. Add an IDInt int IDENTITY column to MainTable. This does a table rebuild and generates the new identity ID values.
  2. Add a MainTableIDInt int column to RelatedTable.

The next step is to populate the RelatedTable.MainTableIDInt column for each row with its corresponding MainTable.IDInt value [based on the matching guid IDs]. This is the step I'm hung up on. I understand this is not going to be speedy, but I'd like to have it perform as well as possible.

I can write a SQL statement that does this update:

UPDATE RelatedTable
SET RelatedTable.MainTableIDInt = (SELECT MainTable.IDInt FROM MainTable WHERE MainTable.IDGuid = RelatedTable.MainTableIDGuid)

or

UPDATE RelatedTable
SET RelatedTable.MainTableIDInt = MainTable.IDInt
FROM RelatedTable 
LEFT OUTER JOIN MainTable ON RelatedTable.MainTableIDGuid = MainTable.IDGuid

The 'Display Estimated Execution Plan' displays roughly the same for both of these queries. The execution plan it spits out does the following:

  1. Clustered index scans over MainTable and RelatedTable and does a Merge Join on them [estimated number of rows = 400 million]
  2. Sorts [estimated number of rows = 400 million]
  3. Clustered index update over RelatedTable [estimated number of rows = 400 million]

I'm concerned about the performance of this [sorting 400 million rows sounds unpleasant]. Are my concerns about performance of these execution plan justified? Is there a better way to update the new ID for my related table that will scale given the size of the tables?

A: 

First, this will be a headache. Second, I wouldn't change any of the indexes or constraints until I had the data in place. I.e., I would add the identity column but not make it the primary key nor clustered index. Then I'd add the soon-to-be new foreign keys to the various tables. Your queries should look like:

Update ChildTable
Set NewIntForeignKeyId = P.NewIntPrimaryKey
From ChildTable As C
    Join ParentTable As P
        On P.PrimaryKey = C.ForeignKey

First, notice that I'm using an inner join. There is no reason to use an outer join for this type of query given that you will eventually enforce referential integrity between the new columns. Second, if you populate the columns first and then rebuild the constraints, it will be faster as you'll be able to leverage the existing indexes. Remember that when you change the clustered index, it rebuilds all of the nonclustered indexes. If the tables are large, that will be a serious hit.

Once you have the data in place, I'd then drop all primary constraints, unique constraints, foreign key constraints and unique indexes. Drop the clustered index/constraint last. I'd then add the clustered indexes to all of the tables and after that was done, recreate the unique constraints, foreign key constraints and indexes. If you do not drop the existing indexes before you recreate the clustered index, it will rebuild the existing indexes twice: once when you drop the clustered index and again when you recreate it.

Btw, I highly doubt there is a way to avoid table scans for this sort of thing since you are going to be updating every row.

Thomas