I have a web application that uses a fairly large table (millions of rows, about 30 columns). Let's call that TableA. Among the 30 columns, this table has a primary key named "id", and another column named "campaignID".
As part of the application, users are able to upload new sets of data pertaining to new "campaigns".
These data sets have the same structure as TableA, but typically only about 10,000-20,000 rows.
Every row in a new data set will have a unique "id", but they'll all share the same campaignID. In other words, the user is loading the complete data for a new "campaign", so all 10,000 rows have the same "campaignID".
Usually, users are uploading data for a NEW campaign, so there are no rows in TableA with the same campaignID. Since the "id" is unique to each campaign, the id of every row of new data will be unique in TableA.
However, in the rare case where a user tries to load a new set of rows for a "campaign" that's already in the database, the requirement was to remove all the old rows for that campaign from TableA first, and then insert the new rows from the new data set.
So, my stored procedure was simple:
- BULK INSERT the new data into a temporary table (#tableB)
- Delete any existing rows in TableA with the same campaignID
- INSERT INTO Table A ([columns]) SELECT [columns] from #TableB
- Drop #TableB
This worked just fine.
But the new requirement is to give users 3 options when they upload new data for handling "duplicates" - instances where the user is uploading data for a campaign that's already in TableA.
- Remove ALL data in TableA with the same campaignID, then insert all the new data from #TableB. (This is the old behavior. With this option, they'll never be duplicates.)
- If a row in #TableB has the same id as a row in TableA, then update that row in TableA with the row from #TableB (Effectively, this is "replacing" the old data with the new data)
- If a row in #TableB has the same id as a row in TableA, then ignore that row in #TableB (Essentially, this is preserving the original data, and ignoring the new data).
A user doesn't get to choose this on a row-by-row basis. She chooses how the data will be merged, and this logic is applied to the entire data set.
In a similar application I worked on that used MySQL, I used the "LOAD DATA INFILE" function, with the "REPLACE" or "IGNORE" option. But I don't know how to do this with SQL Server/T-SQL.
Any solution needs to be efficient enough to handle the fact that TableA has millions of rows, and #TableB (the new data set) may have 10k-20k rows.
I googled for something like a "Merge" command (something that seems to be supported for SQL Server 2008), but I only have access to SQL Server 2005.
In rough pseudocode, I need something like this:
If user selects option 1: [I'm all set here - I have this working]
If user selects option 2 (replace):
merge into TableA as Target
using #TableB as Source
on TableA.id=#TableB.id
when matched then
update row in TableA with row from #TableB
when not matched then
insert row from #TableB into TableA
If user selects option 3 (preserve):
merge into TableA as Target
using #TableB as Source
on TableA.id=#TableB.id
when matched then
do nothing
when not matched then
insert row from #TableB into TableA