tags:

views:

4644

answers:

2

I have a web application that uses a fairly large table (millions of rows, about 30 columns). Let's call that TableA. Among the 30 columns, this table has a primary key named "id", and another column named "campaignID".

As part of the application, users are able to upload new sets of data pertaining to new "campaigns".

These data sets have the same structure as TableA, but typically only about 10,000-20,000 rows.

Every row in a new data set will have a unique "id", but they'll all share the same campaignID. In other words, the user is loading the complete data for a new "campaign", so all 10,000 rows have the same "campaignID".

Usually, users are uploading data for a NEW campaign, so there are no rows in TableA with the same campaignID. Since the "id" is unique to each campaign, the id of every row of new data will be unique in TableA.

However, in the rare case where a user tries to load a new set of rows for a "campaign" that's already in the database, the requirement was to remove all the old rows for that campaign from TableA first, and then insert the new rows from the new data set.

So, my stored procedure was simple:

  1. BULK INSERT the new data into a temporary table (#tableB)
  2. Delete any existing rows in TableA with the same campaignID
  3. INSERT INTO Table A ([columns]) SELECT [columns] from #TableB
  4. Drop #TableB

This worked just fine.

But the new requirement is to give users 3 options when they upload new data for handling "duplicates" - instances where the user is uploading data for a campaign that's already in TableA.

  1. Remove ALL data in TableA with the same campaignID, then insert all the new data from #TableB. (This is the old behavior. With this option, they'll never be duplicates.)
  2. If a row in #TableB has the same id as a row in TableA, then update that row in TableA with the row from #TableB (Effectively, this is "replacing" the old data with the new data)
  3. If a row in #TableB has the same id as a row in TableA, then ignore that row in #TableB (Essentially, this is preserving the original data, and ignoring the new data).

A user doesn't get to choose this on a row-by-row basis. She chooses how the data will be merged, and this logic is applied to the entire data set.

In a similar application I worked on that used MySQL, I used the "LOAD DATA INFILE" function, with the "REPLACE" or "IGNORE" option. But I don't know how to do this with SQL Server/T-SQL.

Any solution needs to be efficient enough to handle the fact that TableA has millions of rows, and #TableB (the new data set) may have 10k-20k rows.

I googled for something like a "Merge" command (something that seems to be supported for SQL Server 2008), but I only have access to SQL Server 2005.

In rough pseudocode, I need something like this:

If user selects option 1: [I'm all set here - I have this working]

If user selects option 2 (replace):

merge into TableA as Target
using #TableB as Source
 on TableA.id=#TableB.id
when matched then 
 update row in TableA with row from #TableB
when not matched then
 insert row from #TableB into TableA

If user selects option 3 (preserve):

merge into TableA as Target
using #TableB as Source
 on TableA.id=#TableB.id
when matched then 
 do nothing
when not matched then
 insert row from #TableB into TableA
A: 

Getting the new date in is fairly easy, you can do it in a set with a left join

The update should be also fairly straight forward as you can also do it in a set

The tricky part is getting the transactions right, its complicated enough getting it right for 1 row.

Sam Saffron
+2  A: 

How about this?

option 2:

begin tran;
delete from tablea where exists (select 1 from tableb where tablea.id=tableb.id);
insert into tablea select * from tableb;
commit tran;

option 3:

begin tran;
delete from tableb where exists (select 1 from tablea where tablea.id=tableb.id);
insert into tablea select * from tableb;
commit tran;

As for performance, so long as the id field(s) in tablea (the big table) are indexed, you should be fine.

Chris
Chris, This solution works perfectly. However, I'm way too much a SQL novice to be able to evaluate whether it's efficient. For example, in this statement - "delete from tablea where exists (select 1 from tableb where tablea.id=tableb.id)" - does that sub-select statement get executed once for every row in tableA? If tableA has 5m rows, does that mean this statement executes 5m sub-selects?In any case, many thanks for a solution that solves the problem!
mattstuehler
In my experience, the 'where exists' sql will be quicker than 'where x in (select id from tableb)' which is your other option. And the latter 'in' sql looks like it does one query, which you'd think is quick, right? Anyway, i'd just give it a try and see how long it takes.
Chris
This solution is fairly expensive as it may cause a bunch of indexes to rebuild even if no data changes. under concurrent usage you will probably get deadlocks and primary key violations.
Sam Saffron