views:

1098

answers:

2

I am trying to insert huge amount of data into SQL server. My destination table has an unique index called "Hash".

I would like to replace my SqlDataAdapter implementation with SqlBulkCopy. In SqlDataAapter there is a property called "ContinueUpdateOnError", when set to true adapter.Update(table) will insert all the rows possible and tag the error rows with RowError property.

The question is how can I use SqlBulkCopy to insert data as quickly as possible while keeping track of which rows got inserted and which rows did not (due to the unique index)?

Thanks in advanced!

Here is the additional information:

  1. The process is iterative, often set on a schedule to repeat.

  2. The source and destination tables can be huge, sometimes millions of rows.

  3. Even though it is possible to check for the hash values first, it requires two transactions per row (first for selecting the hash from destination table, then perform the insertion). I think in the adapter.update(table)'s case, it is faster to check for the RowError than checking for hash hits per row.

+1  A: 

SqlBulkCopy, has very limited error handling facilities, by default it doesn't even check constraints.

However, its fast, really really fast.

If you want to work around the duplicate key issue, and identify which rows are duplicates in a batch. One option is:

  • start tran
  • Grab a tablockx on the table select all current "Hash" values and chuck them in a HashSet.
  • Filter out the duplicates and report.
  • Insert the data
  • commit tran

This process will work effectively if you are inserting huge sets and the size of the initial data in the table is not too huge.

Can you please expand your question to include the rest of the context of the problem.

EDIT

Now that I have some more context here is another way you can go about it:

  • Do the bulk insert into a temp table.
  • start serializable tran
  • Select all temp rows that are already in the destination table ... report on them
  • Insert the data in the temp table into the real table, performing a left join on hash and including all the new rows.
  • commit the tran

That process is very light on round trips, and considering your specs should end up being really fast;

Sam Saffron
Sam,Thank you for your answer. I have updated my question to include more information. Even though I cannot and probably don't want to load all destination hash values into a hashset. I could check for them before the insertion. It requires considerably more round trips to the database and is it the only way?
Paladin
See inline edit
Sam Saffron
Thanks again, Sam. Could you clarify on "Insert the data in the temp table into the real table, performing a left join on hash and including all the new rows."How can I performing a left join on hash and including ONLY the new rows?
Paladin
@Paladin, post a separate question on that
Sam Saffron
I got it. It might be a temporary short circuit. I will recap your answers due to the limited comment space.
Paladin
A: 

Note: This is a recap of Sam's answer with slightly more details

Thanks to Sam for the answer. I have put it in an answer due to comment's space constraints.

Deriving from your answer I see two possible approaches:

Solution 1:

  • start tran
  • grab all possible hit "hash" values by doing "select hash in destinationtable where hash in (val1, val2, ...)
  • filter out duplicates and report
  • insert data
  • commit tran

solution 2:

  • Create temp table to mirror the schema of destination table
  • bulk insert into the temp table
  • start serializable transaction
  • Get duplicate rows: "select hash from tempTable where tempTable.hash=destinationTable.hash"
  • report on duplicate rows
  • Insert the data in the temp table into the destination table: "select * into destinationTable from temptable left join temptable.hash=destinationTable.hash where destinationTable.hash is null"
  • commit the tran

Since we have two approaches, it comes down to which approach is the most optimized? Both approaches have to retrieve the duplicate rows and report while the second approach requires extra:

  • temp table creation and delete
  • one more sql command to move data from temp to destination table
  • depends on the percentage of hash collision, it also transfers a lot of unnecessary data across the wire

If these are the only solutions, it seems to me that the first approach wins. What do you guys think? Thanks!

Paladin
Paladin, you are misunderstanding the second approach. It does not involve pulling a full hash list to the client and instead performs the duplicate analysis on the server. That makes it scale much better one the pre-existing table already has a ton of data.
Sam Saffron
Sam, are you talking about the step "Get duplicate rows: "select hash from tempTable where tempTable.hash=destinationTable.hash""?This step doesn't pull a full hash list to the client, just the matching/collision hashs which I need at the client anyway. It does exactly "Select all temp rows that are already in the destination table ... report on them ". Could you elaborate on the duplicate analysis process? Thanks!
Paladin