tags:

views:

258

answers:

3

Hi,

I have a large table that i am working on and for nearly all of the columns i need to use the replace statement to remove single and double quotes. The code looks like this:

SET QUOTED_IDENTIFIER ON

Update dbo.transactions set transaction_name1 = Replace(transaction_name1,'''','')
Update dbo.transactions set transaction_name2 = Replace(transaction_name2,'''','')
Update dbo.transactions set transaction_name3 = Replace(transaction_name3,'''','')
Update dbo.transactions set transaction_name4 = Replace(transaction_name4,'''','')
Update dbo.transactions set transaction_name5 = Replace(transaction_name5,'''','')

I have not put an index on the table as was not sure exactly what column would be any good being that i'm updating nearly all the columns. If i sorted the table asc by the primary key would that help increase performance?

Over then that the statements have been running for over 2 hours with no error messages and wondered if there is a solution to this performance issue, besides the usual more hardware changes? If someone could advise on ways of increasing performance of the script.

Cheers, Peter

+3  A: 

You can make this a single UPDATE statement:

UPDATE transactions SET
  transaction_name1 = Replace(transaction_name1,'''',''),
  transaction_name2 = Replace(transaction_name2,'''','')
  ... (and so on)

That would likely improve the performance by something approaching a factor of 5.


Edit:

Since this is a one shot thing on a huge dataset (90MM rows), I suggest adding in a where clause and running it in batches.

If your transactions have a primary key, partition the updates on that, doing maybe 500k at once.

Do this in a loop with explicit transactions to keep your log use to a minimum:

DECLARE @BaseID INT, @BatchSize INT
SELECT @BaseID = MAX(YourKey), @BatchSize = 500000 FROM transactions

WHILE @BaseID > 0 BEGIN

  PRINT 'Updating from ' + CAST(@BaseID AS VARCHAR(20))

  -- perform update
  UPDATE transactions SET
    transaction_name1 = Replace(transaction_name1,'''',''),
    transaction_name2 = Replace(transaction_name2,'''','')
    -- ... (and so on)
  WHERE YourKey BETWEEN @BaseID - @BatchSize AND @BaseID

  SET @BaseID = @BaseID - @BatchSize - 1

END

Another note:

If the quotes must not appear in your data, you can create a check constraint to keep them out. It's a last ditch effort as any app attempting to put them in would need to handle a database exception, but it will keep your data clean. Something like this might do it:

ALTER TABLE transactions
  ADD CONSTRAINT CK_NoQuotes CHECK(
    CHARINDEX('''',transaction_name1)=0 AND
    CHARINDEX('''',transaction_name2)=0 AND
    -- and so on...
  )
Michael Haren
Correct the replace function will be used only once and no one has access to the table so no chance of a lock being caused on the table etc. The amount of rows is massive - approx 90 million rows and there are lots of text fields so i know it will take a long time to process. Looking really for advice on improving the code to ensure it runs as best as possible.Cheers
Peter
DaveE
+2  A: 

You can combine the statements, that might be a bit faster:

SET QUOTED_IDENTIFIER ON

UPDATE dbo.transactions
SET transaction_name1 = REPLACE(transaction_name1,'''',''),
    transaction_name2 = REPLACE(transaction_name2,'''',''),
    transaction_name3 = REPLACE(transaction_name3,'''',''),
    transaction_name4 = REPLACE(transaction_name4,'''',''),
    transaction_name5 = REPLACE(transaction_name5,'''','')

Also, check out the estimated execution plan. It might give you a useful advice in how to optimize your database / query (it's a little square button in the button bar of SQL Management Studio).

Zyphrax
Thank you i will have a look at that.
Peter
+1  A: 

You might try making this only a single UPDATE and only updating the rows that need it:

UPDATE dbo.transactions
SET transaction_name1 = REPLACE(transaction_name1,'''',''),
    transaction_name2 = REPLACE(transaction_name2,'''',''),
    transaction_name3 = REPLACE(transaction_name3,'''',''),
    transaction_name4 = REPLACE(transaction_name4,'''',''),
    transaction_name5 = REPLACE(transaction_name5,'''','')
WHERE 
    CHARINDEX('''',transaction_name1)>0
    OR CHARINDEX('''',transaction_name2)>0
    OR CHARINDEX('''',transaction_name3)>0
    OR CHARINDEX('''',transaction_name4)>0
    OR CHARINDEX('''',transaction_name5)>0
KM
I will give it a go and see if the where clause improves the performance. The only reason i did not use it to start with is because i thought it might slow it down being its more work, especially as the table has no index.
Peter
you are going to scan the entire table no matter what, but you don't need to run update every row. the WHERE will give you the largest boost in performance if you have many rows that do not have the quote character in them. if every row has the quote character, then this will slow you down.
KM