I am trying to convert tables from using guid primary keys / clustered indexes to using int identities. This is for SQL Server 2005. There are two tables MainTable
and RelatedTable
, and the current table structure is as follows:
MainTable [40 million rows]
IDGuid - uniqueidentifier - PK
-- [data columns]
RelatedTable [400 million rows]
RelatedTableID - uniqueidentifier - PK
MainTableIDGuid - uniqueidentifier [foreign key to MainTable]
SequenceNumber - int - incrementing number per main table entry since there can be multiple entries related to a given row in the main table. These go from 1,2,3... etc for each MainTableIDGuid value.
-- [data columns]
The clustered index for MainTable
is currently the primary key (IDGuid
). The clustered index for RelatedTable
is currently (MainTableIDGuid, SequenceNumber)
.
I want my conversion is do several things:<
- Change
MainTable
to use an integer ID instead of GUID - Add a
MainTableIDInt
column to related table that links to Main Table's integer ID - Change the primary key and clustered index of
RelatedTable
to(MainTableIDInt, SequenceNumber)
- Get rid of the guid columns.
I've written a script to do the following:
- Add an
IDInt int IDENTITY
column toMainTable
. This does a table rebuild and generates the new identity ID values. - Add a
MainTableIDInt int
column toRelatedTable
.
The next step is to populate the RelatedTable.MainTableIDInt
column for each row with its corresponding MainTable.IDInt
value [based on the matching guid IDs]. This is the step I'm hung up on. I understand this is not going to be speedy, but I'd like to have it perform as well as possible.
I can write a SQL statement that does this update:
UPDATE RelatedTable
SET RelatedTable.MainTableIDInt = (SELECT MainTable.IDInt FROM MainTable WHERE MainTable.IDGuid = RelatedTable.MainTableIDGuid)
or
UPDATE RelatedTable
SET RelatedTable.MainTableIDInt = MainTable.IDInt
FROM RelatedTable
LEFT OUTER JOIN MainTable ON RelatedTable.MainTableIDGuid = MainTable.IDGuid
The 'Display Estimated Execution Plan' displays roughly the same for both of these queries. The execution plan it spits out does the following:
- Clustered index scans over
MainTable
andRelatedTable
and does a Merge Join on them [estimated number of rows = 400 million] - Sorts [estimated number of rows = 400 million]
- Clustered index update over
RelatedTable
[estimated number of rows = 400 million]
I'm concerned about the performance of this [sorting 400 million rows sounds unpleasant]. Are my concerns about performance of these execution plan justified? Is there a better way to update the new ID for my related table that will scale given the size of the tables?