views:

171

answers:

6

I am working on an application that will need to read tons of records (close to 500,000) from one table and insert them into another set of tables in the same database. I though about using SSIS package for this but our DBAs don't want to use that. Now, I am thinking of a multi-threaded approach. I am thinking that I can have a few thread started that will read say (500) records at a time and insert them, then come back and read more.

Now, say I spawn off 3 threads of this application. The first thread reads 500 rows and starts processing them. Can I lock these rows that were already read so that the next thread does not pick them up? I am trying to find some articles regarding this on the internet, but perhaps I am not searching for the correct terms in Google.

Any ideas? or links to articles that might be helpful?

A: 

If you have to insert thousands of records with SQL Server, take a look at bulk inserts, your select query shouldn't pose much problems. But this might all be overkill, if this is a one time operation, copying 500.000 records shouldn't take a long time.

Carra
A: 

You could stick all of the record ID's into a queue and have all of your threads pull a record from the queue and insert it into the other database until the queue is empty. You'll need to make a thread-safe method to pull ID's from the queue though.

something like:

public void InsertNextRecord(){
  while(true){
  int recordID = this.PopRecordID();
  if(recordID == -1)
     return;//exit thread
   ///do whatever it is that you need to do to select the record and re-insert it.

  }
}
public int PopRecordID(){
  lock(this._queue){
    if(this._queue.Count == 0)
      return -1;
     return this._queue.Dequeue();
  }
}

So create however many threads you want and have them execute the InsertNextRecord() method till they are done.

MagicWishMonkey
+1  A: 

Personally, I would just use the Bulk Copy class. If I needed this to run in the background, I'd do it on a single additional thread rather than adding all the complexity. Multi-threading is hard enough to get right, and unless it's truly necessary I would limit it to a single background thread, rather than trying to manage a bunch of them and worry about concurrency.

David Stratton
+3  A: 

Do you really need an app to do this? The most efficient way will be to just execute a SQL statement on the server which transfers the data between the tables.

SqlBulkCopy should be easily fast enough with a single thread. For best performance consider loading the data with a datareader and decorating it (decorator pattern) with a class which does the transformation required. You then pass the decorated IDataReader to the SqlBulkCopy to get a continuous stream of data between tables that will keep memory overhead low and complete the transfer in a matter of seconds.

Example: An input table A with one column of type float, and an output table B with a single column of type float. We will extract all of the numbers from table A and insert the square root of every non-negative number into table B.

class SqrtingDataDecorator : IDataReader
{
    private readonly IDataReader _decorated;
    private double _input;

    public SqrtingDataDecorator(IDataReader decorated)
    {
         _decorated = decorated;
    }
    public bool Read()
    {
        while (_decorated.Read())
        {
            _input = _decorated.GetDouble(0);
            if (_input >= 0)
                return true;
        }
        return false;
    }
    public object GetValue(int index)
    {
        return Math.Sqrt(_input);
    }
    public int FieldCount { get { return 1; } }
    //other IDataReader members just throw NotSupportedExceptions,
    //return null or do nothing. Omitted for clarity.
}

Here is the bit that does the work

//get the input datareader
IDataReader dr = ///.ExecuteDataReader("select floatCol from A", or whatever
using (SqlTransaction tx = _connection.BeginTransaction())
{
    try
    {
        using (SqlBulkCopy sqlBulkCopy =
            new SqlBulkCopy(_connection, SqlBulkCopyOptions.Default, tx))
            {
                sqlBulkCopy.DestinationTableName = "B";
                SetColumnMappings(sqlBulkCopy.ColumnMappings);
                //above method omitted for clarity, easy to figure out

                //now wrap the input datareader in the decorator
                var sqrter = new SqrtingDataDecorator(dr);
                //the following line does the data transfer.
                sqlBulkCopy.WriteToServer(sqrter);
                tx.Commit();
            }
    }
    catch
    {
        tx.Rollback();
        throw;
    }
}
Matt Howells
This app will need to make decisions on every record that will be read before inserting. So it will need to loop over every record. I don't think bulk insert will work in my case.
My suggestion does allow you to examine every row that is read and perform whatever logic you wish to transform it into an output row. It will be a little bit more complicated if you want to insert to multiple tables but the principle is sound.
Matt Howells
That is a great strategy. However, I am somewhat a n00b when it comes to design patterns. Would it be possible for you to provide an example, especially one with the continuous stream of data? Thanks ..
Nice use of decorator.
RichardOD
Awesome Matt ... Thanks ...
@Matt, you need to implement a little bit more of data reader, see: http://github.com/SamSaffron/So-Slow/blob/21328cb3b7f94776f0f57b450a2adc79fe6e0584/MinimalDataReader.cs for the full list, feel free to add a link
Sam Saffron
@Sam, amended comment about other members. Writing this code from memory! Cheers.
Matt Howells
A: 

Is there any way you can avoid doing this with round trips between an app and the DB? Can this all be done within DB code, in a stored proc or set of stored procs?

Joe
A: 

What makes you think multi-threading will make it faster? The bottleneck is probably the disk on your Sql Sever; and multi-threading will make disk throughput lower instead of higher. Sql Sever will have to mix requests from 3 threads to the disk.

If you have to make it multi-threading, you can divide the work by the id of the row. For example, the first thread does rows 1-333, 1000-1333, 2000-2333, and so on.

Andomar
well, what I meant was that the app can process the records in the database faster since it would be performing the one-record logic over several threads.I like your suggestion about partitioning the rows, but how would the individual threads know what to read? I am thinking that the initial reading of '# of unprocessed records' would have to be captured by the parent thread. Then it would need to let the children thread know only the rows it would need to access. Is that a proper method?
When starting a thread, you can pass parameters to it's startup method. Use the parameters to specify the work for that specific thread.
Andomar