views:

71

answers:

5

First off let me say I am running on SQL Server 2005 so I don't have access to MERGE.

I have a table with ~150k rows that I am updating daily from a text file. As rows fall out of the text file I need to delete them from the database and if they change or are new I need to update/insert accordingly.

After some testing I've found that performance wise it is exponentially faster to do a full delete and then bulk insert from the text file rather than read through the file line by line doing an update/insert. However I recently came across some posts discussing mimicking the MERGE functionality of SQL Server 2008 using a temp table and the output of the UPDATE statement.

I was interested in this because I am looking into how I can eliminate the time in my Delete/Bulk Insert method when the table has no rows. I still think that this method will be the fastest so I am looking for the best way to solve the empty table problem.

Thanks

+4  A: 

I think your fastest method would be to:

  1. Drop all foreign keys and indexes from your table.
  2. Truncate your table.
  3. Bulk insert your data.
  4. Recreate your foreign keys and indexes.
Joe Stefanelli
Thanks for the tips, I didn't know about Truncate and will likely use it, but I am trying to eliminate the short time period between the deletion and bulk insert when the table is empty. Any ideas?
rpf3
@rpf3: Give the TRUNCATE a try. I think it will eliminate much of the delay you're talking about.
Joe Stefanelli
The Truncate was definitely faster than the delete but it still takes ~9 seconds for the Bulk Insert. I've been asked to see if there is a way to eliminate even this small amount of downtime because other processes might hit the database during runtime.
rpf3
@rpf3: If you've followed the steps I've given, then I'm not aware of anything else to speed this up. Honestly, 9 seconds to bulk insert 150K rows once a day doesn't sound unreasonable to me.
Joe Stefanelli
Oh I completely agree with you, in fact it's a massive improvement to how it was being done previously row by row. However, due to the nature of the data in the table, if a query was run and no row was returned during those 9 seconds it could potentially be very bad. This is a fringe case but with automated processes running all day that have the potential to hit the DB, it may happen. Is there a way to do the bulk insert into a temp table and then swap the two or something?
rpf3
@rpf3: You could bulk insert into a temp table and try something like [sp_rename](http://msdn.microsoft.com/en-us/library/ms188351.aspx), which I think would require a table lock which would be just as harmful. You could try creating two versions of the table and create a view that would alternate between the two while the opposite one was bulk-inserted. Ultimately though, my gut reaction is that it may just be safer and easier to code some retry/exception handling logic into those automated processes.
Joe Stefanelli
A: 

For raw speed, I think with ~150K rows in the table, I'd just drop the table, recreate it from scratch (without indexes) and then bulk load afresh. Once the bulk load has been done, then create the indexes.

This assumes of course that having a period of time when the table is empty/doesn't exist is acceptable which it does sound like could be the case.

AdaTheDev
A: 

Is the problem that Joe's solution is not fast enough, or that you can not have any activity against the target table while your process runs? If you just need to prevent users from running queries against your target table, you should contain your process within a transaction block. This way, when your TRUNCATE TABLE executes, it will create a table lock that will be held for the duration of the transaction, like so:

begin tran;

truncate table stage_table

bulk insert stage_table
from N'C:\datafile.txt'

commit tran;
Sake God
I was thinking about doing this but if you don't have permissions to access either the datafile or the formatfile an error gets thrown that cannot be caught by SQL TRY/CATCH and will stop the code mid transaction, leaving it open.
rpf3
+1  A: 

An alternative solution which would satsify your requirement for not having "down time" for the table you are updating.

It sounds like originally you were reading the file and doing an INSERT/UPDATE/DELETE 1 row at a time. A more performant approach than that, that does not involve clearing down the table is as follows:

1) bulk load the file into a new, separate table (no indexes)
2) then create the PK on it
3) Run 3 statements to update the original table from this new (temporary) table:
DELETE rows in the main table that don't exist in the new table
UPDATE rows in the main table where there is a matching row in the new table
INSERT rows into main table from the new table where they don't already exist

This will perform better than row-by-row operations and should hopefully satisfy your overall requirements

AdaTheDev
Thanks, I'm gonna run some tests to see if I want to use this or just keep the bulk insert inside a locked transaction for the short run time.
rpf3
+1  A: 

There is a way to update the table with zero downtime: keep two day's data in the table, and delete the old rows after loading the new ones!

  1. Add a DataDate column representing the date for which your ~150K rows are valid.
  2. Create a one-row, one-column table with "today's" DataDate.
  3. Create a view of the two tables that selects only rows matching the row in the DataDate table. Index it if you like. Readers will now refer to this view, not the table.
  4. Bulk insert the rows. (You'll obviously need to add the DataDate to each row.)
  5. Update the DataDate table. View updates Instantly!
  6. Delete yesterday's rows at your leisure.

SELECT performance won't suffer; joining one row to 150,000 rows along the primary key should present no problem to any server less than 15 years old.

I have used this technique often, and have also struggled with processes that relied on sp_rename. Production processes that modify the schema are a headache. Don't.

James K. Lowden