views:

1577

answers:

12

I had someting like this in my code (.Net 2.0, MS SQL)

SqlConnection connection = new SqlConnection(@"Data Source=localhost;Initial
Catalog=DataBase;Integrated Security=True");
  connection.Open();

  SqlCommand cmdInsert = connection.CreateCommand();
  SqlTransaction sqlTran = connection.BeginTransaction();
  cmdInsert.Transaction = sqlTran;

  cmdInsert.CommandText =
     @"INSERT INTO MyDestinationTable" +
      "(Year, Month, Day, Hour,  ...) " +
      "VALUES " +
      "(@Year, @Month, @Day, @Hour, ...) ";

  cmdInsert.Parameters.Add("@Year", SqlDbType.SmallInt);
  cmdInsert.Parameters.Add("@Month", SqlDbType.TinyInt);
  cmdInsert.Parameters.Add("@Day", SqlDbType.TinyInt);
  // more fields here
  cmdInsert.Prepare();

  Stream stream = new FileStream(fileName, FileMode.Open, FileAccess.Read);

  StreamReader reader = new StreamReader(stream);
  char[] delimeter = new char[] {' '};
  String[] records;
  while (!reader.EndOfStream)
  {
    records = reader.ReadLine().Split(delimeter, StringSplitOptions.None);

    cmdInsert.Parameters["@Year"].Value = Int32.Parse(records[0].Substring(0, 4));
    cmdInsert.Parameters["@Month"].Value = Int32.Parse(records[0].Substring(5, 2));
    cmdInsert.Parameters["@Day"].Value = Int32.Parse(records[0].Substring(8, 2));
    // more here complicated stuff here
    cmdInsert.ExecuteNonQuery()
  }
  sqlTran.Commit();
  connection.Close();

With cmdInsert.ExecuteNonQuery() commented out this code executes in less than 2 sec. With SQL execution it takes 1m 20 sec. There are around 0.5 milion records. Table is emptied before. SSIS data flow task of similar functionality takes around 20 sec.

  • Bulk Insert was not an option (see below). I did some fancy stuff during this import.
  • My test machine is Core 2 Duo with 2 GB RAM.
  • When looking in Task Manager CPU was not fully untilized. IO seemed also not to be fully utilized.
  • Schema is simple like hell: one table with AutoInt as primary index and less than 10 ints, tiny ints and chars(10).

After some answers here I found that it is possible to execute bulk copy from memory! I was refusing to use bulk copy beacuse I thought it has to be done from file...

Now I use this and it takes aroud 20 sec (like SSIS task)

  DataTable dataTable = new DataTable();

  dataTable.Columns.Add(new DataColumn("ixMyIndex", System.Type.GetType("System.Int32")));   
  dataTable.Columns.Add(new DataColumn("Year", System.Type.GetType("System.Int32")));   
  dataTable.Columns.Add(new DataColumn("Month", System.Type.GetType("System.Int32")));
  dataTable.Columns.Add(new DataColumn("Day", System.Type.GetType("System.Int32")));
 // ... and more to go

  DataRow dataRow;
  object[] objectRow = new object[dataTable.Columns.Count];

  Stream stream = new FileStream(fileName, FileMode.Open, FileAccess.Read);

  StreamReader reader = new StreamReader(stream);
  char[] delimeter = new char[] { ' ' };
  String[] records;
  int recordCount = 0;
  while (!reader.EndOfStream)
  {
    records = reader.ReadLine().Split(delimeter, StringSplitOptions.None);

    dataRow = dataTable.NewRow();
    objectRow[0] = null; 
    objectRow[1] = Int32.Parse(records[0].Substring(0, 4));
    objectRow[2] = Int32.Parse(records[0].Substring(5, 2));
    objectRow[3] = Int32.Parse(records[0].Substring(8, 2));
    // my fancy stuf goes here

    dataRow.ItemArray = objectRow;         
    dataTable.Rows.Add(dataRow);

    recordCount++;
  }

  SqlBulkCopy bulkTask = new SqlBulkCopy(connection, SqlBulkCopyOptions.TableLock, null);
  bulkTask.DestinationTableName = "MyDestinationTable"; 
  bulkTask.BatchSize = dataTable.Rows.Count;
  bulkTask.WriteToServer(dataTable);
  bulkTask.Close();
+3  A: 

Is required the transaction? Using transaction need much more resources than simple commands.

Also If you are sure, that inserted values are corect, you can use a BulkInsert.

TcKs
I use one transacion for all operations so IMHO it does not matter. Am I wrong ?
Michal Sznajder
Yes you are wrong. Transaction should be so short as they can be. Because managing transaction ( and the data in them ) are littlebit expensive.
TcKs
+2  A: 

1 minute sounds pretty reasonable for 0.5 million records. That's a record every 0.00012 seconds.

Does the table have any indexes? Removing these and reapplying them after the bulk insert would improve performance of the inserts, if that is an option.

Ian Nelson
One primary index on auto int field
Michal Sznajder
+1  A: 

It doesn't seem unreasonable to me to process 8,333 records per second...what kind of throughput are you expecting?

JustinD
+1  A: 

If you need better speed, you might consider implementing bulk insert:

http://msdn.microsoft.com/en-us/library/ms188365.aspx

stephenbayer
A: 

I assume that what is taking the approximately 58 seconds is the physical inserting of 500,000 records - so you are getting around 10,000 inserts a second. Without knowing the specs of your database server machine (I see you are using localhost, so network delays shouldn't be an issue), it is hard to say if this is good, bad, or abysmal.

I would look at your database schema - are there a bunch of indices on the table that have to be updated after each insert? This could be from other tables with foreign keys referencing the table you are working on. There are SQL profiling tools and performance monitoring facilities built into SQL Server, but I've never used them. But they may show up problems like locks, and things like that.

Ken Ray
A: 

Do the fancy stuff on the data, on all records first. Then Bulk-Insert them.

(since you're not doing selects after an insert .. i don't see the problem of applying all operations on the data before the BulkInsert

sirrocco
A: 

If I had to guess, the first thing I would look for are too many or the wrong kind of indexes on the tbTrafficLogTTL table. Without looking at the schema definition for the table, I can't really say, but I have experienced similar performance problems when:

  1. The primary key is a GUID and the primary index is CLUSTERED.
  2. There's some sort of UNIQUE index on a set of fields.
  3. There are too many indexes on the table.

When you start indexing half a million rows of data, the time spent to create and maintain indexes adds up.

I will also note that if you have any option to convert the Year, Month, Day, Hour, Minute, Second fields into a single datetime2 or timestamp field, you should. You're adding a lot of complexity to your data architecture, for no gain. The only reason I would even contemplate using a split-field structure like that is if you're dealing with a pre-existing database schema that cannot be changed for any reason. In which case, it sucks to be you.

Craig Trader
A: 

I had a similar problem in my last contract. You're making 500,000 trips to SQL to insert your data. For a dramatic increase in performance, you want to investigate the BulkInsert method in the SQL namespace. I had "reload" processes that went from 2+ hours to restore a couple of dozen tables down to 31 seconds once I implemented Bulk Import.

David
A: 

This could best be accomplished using something like the bcp command. If that isn't available, the suggestions above about using BULK INSERT are your best bet. You're making 500,000 round trips to the database and writing 500,000 entries to the log files, not to mention any space that needs to be allocated to the log file, the table, and the indexes.

If you're inserting in an order that is different from your clustered index, you also have to deal with the time require to reorganize the physical data on disk. There are a lot of variables here that could possibly be making your query run slower than you would like it to.

~10,000 transactions per second isn't terrible for individual inserts coming roundtripping from code/

Jeremiah Peschka
+7  A: 

Instead of inserting each record individually, Try using the SqlBulkCopy class to bulk insert all the records at once.

Create a DataTable and add all your records to the DataTable, and then use SqlBulkCopy.WriteToServer to bulk insert all the data at once.

Adam Hughes
+1  A: 

If some form of bulk insert isn't an option, the other way would be multiple threads, each with their own connection to the database.

The issue with the current system is that you have 500,000 round trips to the database, and are waiting for the first round trip to complete before starting the next - any sort of latency (ie, a network between the machines) will mean that most of your time is spent waiting.

If you can split the job up, perhaps using some form of producer/consumer setup, you might find that you can get much more utilisation of all the resources.

However, to do this you will have to lose the one great transaction - otherwise the first writer thread will block all the others until its transaction is completed. You can still use transactions, but you'll have to use a lot of small ones rather than 1 large one.

The SSIS will be fast because it's using the bulk-insert method - do all the complicated processing first, generate the final list of data to insert and give it all at the same time to bulk-insert.

Jim T
A: 

BULK INSERT = bcp from a permission

You could batch the INSERTs to reduce roundtrips SQLDataAdaptor.UpdateBatchSize = 10000 gives 50 round trips

You still have 500k inserts though...

Article

MSDN

gbn