views:

158

answers:

3

I have a table (several actually) that contain a lot of columns (maybe 100+). What's best performance-wise when updating rows in the table, if only a few columns have been changed.

  1. To build the UPDATE statement dynamically only updating the changed columns.
  2. To build a parameterized UPDATE-statement containing all columns, including those that have not changed.
  3. To create a procedure that takes ALL values as parameters and updates the row.

I'm using SQL Server. There are no BLOBS in the table.

Thanks / M

+3  A: 

I would say number 2 and 3 are equivalent from a performance perspective. If you are using a PK to figure out which row to update and it is a clustered key, then I wouldn't worry about updating a column to itself. The problem with the 1st situation is that you are going to cause "procedure cache bloat", where you have many similar plans all taking up your plan cache because they are a slightly different iteration of the update.

If you plan on doing massive updates, I might hesitate to recommend updating all columns, since it may cause FK look-ups, etc.

Thanks, Eric

Strommy
Will the update cause a FK-lookup even if the value of the FK doesn't change?
Mikael Sundberg
I think it won't if you set a column to itself, but I'm not sure. Let me check and get back to you.
Strommy
It appears lookups occur even if you update to the same value. As such, doing a large amount of updates on a table with FKs pointing to other tables may cause scans or seeks to the other tables. (Also a good reason to make sure all of your PK/FK combos are indexed).
Strommy
A: 

I'd vote for p.1 mixed with p.2, i.e. dynamically build a parametrized UPDATE statement that will update only changed columns. This will work for the case when your read/write rate is on the 'read' side and you're not doing updates too frequently so we can safely trade query plan caching for (physical) update performance.

AlexS
+1  A: 

Options 2 and 3 require more data transmitted to the server on an update - and thus have a bigger communication overhead for just the data.

Does each row have a different set of updated columns, or is the set of columns updated the same for any given run (but the list might vary from run to run)?

In the latter case (same set of columns updated on a given run), then option 1 is likely to perform better; the statement will be prepared once and used many times with a minimum of data transferred to the server for each update.

In the former case, I would look to see whether there is a relatively small subset of the columns that are changed (say 10 columns that are changed in different rows, even if any one row only changes up to 3 of those 10). In that case, I'd probably parameterize for the 10 columns, accepting the relatively small overhead of transmitting 7-9 column values that have not changed for the convenience of a single prepared statement. If the set of updated columns is all over the map (say more than 50 of the 100 columns are updated over the entire operation), then it is probably simpler just to deal with the whole lot.

To some extent, it depends on how easy your host language (client API) makes it to handle the various possible ways of parameterizing the updates.

Jonathan Leffler