views:

1363

answers:

6

Hi All,

I'm just beginning to learn the concept of threading, and I'm kind of stuck at this one problem, its driving me crazy....

What I actually need to accomplish -

I have some 300 text files in a local directory, that need to be parsed for specific values... After I find these "values" in each text file, I need to store them in a database.. So I followed the plain approach of accessing each text file in the directory - parsing and updating the resulting values as a row to a local DataTable, and when I'm done parsing all the files and storing 300 rows to the DataTable, I would do a SQLBulkCopy of DataTable to my database. This approach works fine except that it takes me about 10 minutes to run my code!

What I'm attempting to do now -

Create a new thread for each file and keep the thread count below 4 at any given time... then each thread would parse through the file and return a row to update the local DataTable

Where I'm stuck - I don't understand how to update this single Datatable that gets rows from multiple threads...

Quite an explanation isn't it.. hope some one here can suggest a good idea for this...

Thanks, Nidhi

+1  A: 

This will be much easier if you just let each of your four threads write to the database themselves. In this scenario you don't have to worry about threading (except for what files each thread works on) as each worker thread could maintain their own datatable and consume 25% of the files.

Alternatively, you can have a single datatable that all the threads use--just make sure to wrap accesses to it with a lock like so:

lock(YourTable.Rows.SyncRoot){
  // add rows to table
}


Of course this is all moot if the bottleneck is the disk, as @David B notes.

Michael Haren
A: 

Why would multiple threads make your limited resource (disk access) more abundant?

David B
A: 

SQLBulkCopy is a big hammer for only 300 rows.

Check out Smart Thread Pool. This is an instance thread pool that you can limit to 4 threads very easily. Since you only have 300 rows consider post them directly to SQL in each thread rather than aggregating in you code.

Gary
Agreed. 300 rows is nothing. With connection pooling, and a light query, it'd be no big deal to hit the database one at a time for 300 times.
Michael Haren
Thanks for prompt replies everyone.So I get that "SQLBulkCopy is a big hammer for only 300 rows." In future in what scenarios can I use SQLBulkCopy, when I have 1000 or 10000 or how many rows.. assuming we have 10 columns?I'll follow the advice of each thread inserting its row in the end to the database... that solves my problem..
Nidhi
But I still don't know how to update a datatable, in a case where all the threads report to a single DataTable..lock(YourTable.Rows.SyncRoot){ // add rows to table}orlock(myTable){ myTable.AddnewRow(....);}Doesn't really help me much, can anyone link me up a good sample code
Nidhi
A: 

As the others have pointed out, remember to lock your table before updating. C#:

private object tableLock;

/*
Later in code.
*/

private void UpdateDataTable(object data)
{
    lock(tableLock)
    {
          //Add or update table rows
    }
}

As for methods of actually controlling and keeping the threads in line, just use a ThreadPool object, set the maximum threads to your limit, and the queuing can take care of things. For additional control you can toss in some logic that uses an array of WaitHandle objects. In fact that might actually be a good idea considering that you want to queue up 300 separate objects.

Zensar
A: 

What made you think that more threads would improve things? They probably won't.

I suggest you first get the program to work, then worry about getting it to work faster. Do it with only one thread.

John Saunders
+1  A: 

As was somewhat pointed out, you need to examine exactly where your bottleneck is and why you're using threading.

By moving to multiple threads, you do have a potential for increased performance. However, if you're updating the same DataTable with each thread, you're limited by the DataTable. Only one thread can write to the DataTable at one time (which you control with a lock), so you're still fundamentally processing in sequence.

On the other hand, most databases are designed for multiple connections, running on multiple threads, and have been highly tuned for that purpose. If you want to still use multiple threads: let each thread have its own connection to the database, and do its own processing.

Now, depending on the kind of processing going on, your bottleneck may be in opening and processing the file, and not in the database update.

One way to split things up:

  1. Put all the file names to be processed into a filename Queue.
  2. Create a thread (or threads) to pull an item off the filename Queue, open and parse and process the file, and push the results into a result Queue.
  3. Have another thread take the results from the result Queue, and insert them into the database.

These can run simultaneously... the database won't be updated until there's something to update, and will just wait in the meantime.

This approach lets you really know who is waiting on whom. If the read/process file part is slow, create more threads to do that. If the insert into database part is slow, create more threads to do that. The queues just need to be synchronized.

So, pseudocode:

Queue<string> _filesToProcess = new Queue<string>();
Queue<string> _results = new Queue<string>();
Thread _fileProcessingThread = new Thread( ProcessFiles );
Thread _databaseUpdatingThread = new Thread( UpdateDatabase );
bool _finished = false;

static void Main()
{
    foreach( string fileName in GetFileNamesToProcess() )
    {
       _filesToProcess.Enqueue( fileName );
    }

    _fileProcessingThread.Start();
    _databaseUpdatingThread.Start();

    // if we want to wait until they're both finished
    _fileProcessingThread.Join();
    _databaseUpdatingThread.Join();

    Console.WriteLine( "Done" );
}

void ProcessFiles()
{
   bool filesLeft = true;

   lock( _filesToProcess ){ filesLeft = _filesToProcess.Count() > 0; }

   while( filesLeft )
   {
      string fileToProcess;
      lock( _filesToProcess ){ fileToProcess = _filesToProcess.Dequeue(); }

      string resultAsString = ProcessFileAndGetResult( fileToProcess );

      lock( _results ){ _results.Enqueue( resultAsString ); }

      Thread.Sleep(1); // prevent the CPU from being 100%

      lock( _filesToProcess ){ filesLeft = _filesToProcess.Count() > 0; }
   }

   _finished = true;
}

void UpdateDatabase()
{
   bool pendingResults = false;

   lock( _results ){ pendingResults = _results.Count() > 0; }

   while( !_finished || pendingResults )
   {
      if( pendingResults )
      {
         string resultsAsString;
         lock( _results ){ resultsAsString = _results.Dequeue(); }

         InsertIntoDatabase( resultsAsString ); // implement this however
      }

      Thread.Sleep( 1 ); // prevents the CPU usage from being 100%

      lock( _results ){ pendingResults = _results.Count() > 0; }
   }
}

I'm pretty sure there's ways to make that "better", but it should do the trick so you can read and process data while also adding completed data to the database, and take advantage of threading.

If you want another Thread to process files, or to update the database, just create a new Thread( MethodName ), and call Start().

It's not the simplest example, but I think it's thorough. You're synchronizing two queues, and you need to make sure each is locked before accessing. You're keeping track of when each thread should finish, and you have data being marshaled between threads, but never processed more than once, using Queues.

Hope that helps.

Jonathan Mitchem