views:

37

answers:

2

I have a merge statement that needs to compare on many columns. The source table has 26,000 rows. The destination table has several million rows. The desintation table only has a typical Primary Key index on an int-type column.

I did some selects with group by to count the number of unique values in the source.

The test part of the Merge is

Merge Into desttable
Using #temptable
On 
(
desttable.ColumnA = #temptable.ColumnA  
and
desttable.ColumnB  =  #temptable.ColumnB  
and
desttable.ColumnC  = #temptable.ColumnC 
and
desttable.ColumnD  = #temptable.ColumnD 
and
desttable.ColumnE  = #temptable.ColumnE 
and
desttable.ColumnF  = #temptable.ColumnF
) 
When Not Matched Then  Insert Values (.......)

-- ColumnA: 167 unique values in #temptable
-- ColumnB: 1 unique values in #temptable
-- ColumnC: 13 unique values in #temptable
-- ColumnD: 89 unique values in #temptable
-- ColumnE: 550 unique values in #temptable
-- ColumnF: 487 unique values in #temptable

-- ColumnA: 3690 unique values in desttable
-- ColumnB: 3 unique values (plus null is possible) in desttable
-- ColumnC: 1113 unique values in desttable
-- ColumnD: 2662 unique values in desttable
-- ColumnE: 1770 unique values in desttable
-- ColumnF: 1480 unique values in desttable

The merge right now takes a very, very long time. I think I need to change my primary key but am not sure what the best tactic might be. 26,000 rows can be inserted on the first merge, but subsequent merges might only have ~2,000 inserts to do. Since I have no indexes and only a simple PK, everything is slow. :)

Can anyone point out how to make this better?

Thanks!

+2  A: 

Well, an obvious candidate would be an index on the columns you use to do your matching in the MERGE statement - do you have an index on (ColumnA, ColumnB, ColumnC, ColumnD, ColumnE, ColumnF) on your destitation table??

This tuple of columns is being used to determine whether or not a row from your source table already exists in the database. If you don't have that index nor any other usable index in place, you get a table scan on the large destination table for each row in your source table, basically.

If not: I would try to add it and then see how the runtime behavior changes. Does the MERGE now run a little less than a very, very long time??

marc_s
Experimenting a little more, I made an additional column in the desttable varchar(42) 'HashCol', made a unique index on it, then added the col to the temp table with this:Update #temptableSet HashCol = (SELECT SUBSTRING(master.dbo.fn_varbintohexstr( HASHBYTES('Sha1' , Cast([ColumnA] as VarChar(16))+ ColumnB+ ColumnC+ ColumnD+ ColumnE+ ColumnF) ),3,32 ) )then in the proc I indexed the HashCol and then merge on that.I haven't tried to make a composite index yet because I am not sure what the optimal order is for the columns, IIRC the most unique cols should be first.
ScSub
I have decided to stick with adding the HashCol because I found an edge case where ColumnB can be null which raised an error which was easy to see in the script. I also just found out that the columns can change and it's easier to change 'what matters' with the HashCol than it is to drop and recreate the composite index.Thanks all for the suggestions!
ScSub
+1  A: 

My suggestion is if you only need to run it once, then Merge statement is acceptable if time is not that critical. But, if you're going to use the script more often, I think it'll be better if you do it step by step instead of using the Merge statement. Step by step, like creating your own select, insert, update, delete statements in order to attain the goal. With this you'll have more control almost on everything(query optimization, indexing, etc...)

In your case, probably separating the 6 where criteria might be more efficient than combining them all at once. Downside is you'll have longer script.

koderoid
Why roll your own?? I highly doubt you'll manage to be faster than the MERGE statement...... plus: the MERGE statement is one transactional unit - if you roll your own, you have to do a lot of basic bookkeeping and administration yourself....
marc_s
exactly, it'll be faster to break down one transactional unit with a huge set of data to a multiple transaction unit with small sets of data.like on his sample, if he's going to process 26,000 or million rows.would it be better to have 1 criteria per set:SET_A = where Table1.ColA = Table2.ColASET_B = SET_A where Table1.ColB = Table2.ColB...than 1 huge set all criteria:Result=where Table1.ColA = Table2.ColA and Table1.ColB = Table2.ColB and ...Regardless of Indexing at this point. Regarding Adminisitration, Bookkeeping? it's just one script, I don't see any issue with that.
koderoid
@koderoid - Those 6 columns are the joining criteria - I can't see how you propose they can be split up.
Martin Smith
@TempTABLE1 = from (physical table/s) where Table1.ColA = Table2.ColA@TempTABLE1 = from @TempTABLE1 where Table1.ColB = Table2.ColBand so on.. It can also be @TempTABLE1, @TempTABLE2...
koderoid