views:

2670

answers:

11

Hi,

I need to upload a massive (16GB, 65+ million records) CSV file to a single table in a SQL server 2005 database. Does anyone have any pointers on the best way to do this?

Details

I am currently using a C# console application (.NET framework 2.0) to split the import file into files of 50000 records, then process each file. I upload the records into the database from the console application using the SqlBulkCopy class in batches of 5000. To split the files takes approximately 30 minutes, and to upload the entire data set (65+ million records) takes approximately 4.5 hours. The generated file size and the batch upload size are both configuration settings, and I am investigating increasing the value of both to improve performance. To run the application, we use a quad core server with 16GB RAM. This server is also the database server.

Update

Given the answers so far, please note that prior to the import:

  • The database table is truncated, and all indexes and constraints are dropped.
  • The database is shrunk, and disk space reclaimed.

After the import has completed:

  • The indexes are recreated

If you can suggest any different approaches, or ways we can improve the existing import application, I would appreciate it. Thanks.

Related Question

The following question may be of use to others dealing with this problem:

Solution

I have investigated the affect of altering batch size, and the size of the split files, and found that batches of 500 records, and split files of 200,000 records work best for my application. Use of the SqlBulkCopyOptions.TableLock also helped. See the answer to this question for further details.

I also looked at using a SSIS DTS package, and a BULK INSERT SQL script. The SSIS package appeared quicker, but did not offer me the ability to record invalid records, etc. The BULK INSERT SQL script whilst slower than the SSIS package, was considerably faster than the C# application. It did allow me to record errors, etc, and for this reason, I am accepting the BULK INSERT answer from ConcernedOfTunbridgeWells as the solution. I'm aware that this may not be the best answer for everyone facing this issue, but it answers my immediate problem.

Thanks to everyone who replied.

Regards, MagicAndi

+2  A: 

The SqlBulkCopy class that you're already using is going to be your best bet. The best you can do from here in your c# code is experiment with your particular system and data to see what batch sizes work best. But you're already doing that.

Going beyond the client code, there might be some things you can do with the server to make the import run more efficiently:

  • Try setting the table and database size before starting the import to something large enough to hold the entire set. You don't want to rely on auto-grow in the middle of this.

  • Depending on how the data is sorted and any indexes one the table, you might do a little a better to drop any indexes that don't match the order in which the records are imported, and then recreate them after the import.

  • Finally, it's tempting to try running this in parallel, with a few threads doing bulk inserts at one time. However, the biggest bottleneck is almost certainly disk performance. Anything you can do to the physical server to improve that (new disks, san, etc) will help much more.

Joel Coehoorn
Joel, Thanks for your answer. See the updated question for some new information, responding to some of the points from your answer. However, I was a bit surprised to see that you recommended the use of pararllel threads. The threads will be uploading against the same database table on the server. Won't the bulk upload operation from one thread lock the table and lead to the other threads waiting for it to release the table?
MagicAndi
I didn't mean to actually recommend parallel threads. I was trying to say it would be tempting, but not likely to actually get you anywhere because disk performance is more important.
Joel Coehoorn
Thanks for clarifying. +1.
MagicAndi
In regards to your updates: see my 1st bullet point. You want to pre-grow the db to reserve space for the data. Repeated auto-grow during the middle of a large batch import is not fun.
Joel Coehoorn
Edited to clarify the multi-thread point.
Joel Coehoorn
Another idea: where I'm at we do a load like this every night, but the db is otherwise idle during that load so it's no big deal. If your db has to serve other requests while this runs, you might want to isolate this table to it's own file group on it's own drive/array.
Joel Coehoorn
A: 

Have you tried using the Bulk Insert method in Sql Server?

Kevin
+4  A: 

BULK INSERT is run from the DBMS itself, reading files described by a bcp control file from a directory on the server (or mounted on it). Write an application that splits the file into smaller chunks, places them in an appropriate directory executes a wrapper that executes a series of BULK INSERTS. You can run several threads in parallel if necessary.

This is probably about as fast as a bulk load gets. Also, if there's a suitable partitioning key available in the bulk load file, put the staging table on a partition scheme.

Also, if you're bulk loading into a table with a clustered index, make sure the data is sorted in the same order as the index. Merge sort is your friend for large data sets.

ConcernedOfTunbridgeWells
That's what the SqlBulkCopy class does.
Joel Coehoorn
Not strictly true. SqlBulkCopy wraps the OLEDB bulk load API which is still pushing the data over the client-server link. BULK INSERT runs in-process on the server.
ConcernedOfTunbridgeWells
ConcernedOfTurnbridgeWells, Thanks for the answer +1. I test your solution out and leave some feedback ASAP.
MagicAndi
Bulk insert is THE fastest way to load data into SQL server. See: http://msdn.microsoft.com/en-us/library/ms190421.aspx for tips on further optimization.
codeelegance
A: 

Lately, I had to upload/import a lot of stuff, too (built a PHP script).

I decided to process them record-for-record.

Of course, it takes longer, but for me, the following points were important: - easily pause the process - better debugging

This is just a tip.

regards, Benedikt

Benedikt R.
Benedikt, using the C# application, I am still processing each record whilst reading the generated files. This allows me to validate each record, etc prior to attempting to upload, should I wish to.
MagicAndi
+3  A: 

Have you tried SSIS (SQL Server Integration Services).

Chris Brandsma
Chris, Thanks for your answer, I would be interested in looking further at using SSIS. Could you link to examples of using SSIS to upload data from a file? Thanks.
MagicAndi
SSIS has a native CSS reader (Flat File Source). Point the reader to the correct output type as well. SSIS is also supposed to batch the inserts. Unfortunately, teaching SSIS is a larger topic than this format allows -- and SSIS is rather graphical in nature.
Chris Brandsma
A: 

BULK INSERT is probably already the fastest way. You can gain additional performance by dropping indexes and constraints while inserting and reestablishing them later. The highest performance impact comes from clustered indexes.

Daniel Brückner
A: 

Have you tried SQL Server Integration Services for this? It might be better able to handle such a large text file

Conrad
A: 

Just to check, your inserting will be faster if there are no indexes on the table you are inserting into.

ck
This is only partly true. I've seen batch imports take _longer_ when trying this, because the import data already matched the index order.
Joel Coehoorn
+2  A: 

You may be able to save the step of splitting the files as follows:

  • Instantiate an IDataReader to read the values from the input CSV file. There are several ways to do this: the easiest is probably to use the Microsoft OleDb Jet driver. Google for this if you need more info - e.g. there's some info in this StackOverflow question.

    An alternative method is to use a technique like that used by www.csvreader.com.

  • Instantiate a SqlBulkCopy object, set the BatchSize and BulkCopyTimeout properties to appropriate values.

  • Pass the IDataReader to SqlBulkCopy.WriteToServer method.

I've used this technique successfully with large files, but not as large as yours.

Joe
+1  A: 

See this and this blog posts for a comparison. It seems the best alternative is to use BulkInsert with the TABLOCK option set to true.

santiiiii
Santiiiii, Thanks for the links, much appreciated. +1
MagicAndi
A: 

My scenario for things like that is: Create SSIS Package on SQL server which using BLUK insert into sql, Create stored procedure inside the DataBase to can run that Package from T-SQL code

After that send file for bluk insert to SQL server using FTP and call SSIS Package usinfg stored procedure

adopilot