views:

615

answers:

7

We have a situation where our application needs to process a series of files and rather than perform this function synchronously, we would like to employ multi-threading to have the workload split amongst different threads.

Each item of work is:
1. Open a file for read only
2. Process the data in the file
3. Write the processed data to a Dictionary

We would like to perform each file's work on a new thread? Is this possible and should be we better to use the ThreadPool or spawn new threads keeping in mind that each item of "work" only takes 30ms however its possible that hundreds of files will need to be processed.

Any ideas to make this more efficient is appreciated.

EDIT: At the moment we are making use of the ThreadPool to handle this. If we have 500 files to process we cycle through the files and allocate each "unit of processing work" to the threadpool using QueueUserWorkItem.

Is it suitable to make use of the threadpool for this?

A: 

I suggest you have a finite number of threads (say 4) and then have 4 pools of work. I.e. If you have 400 files to process have 100 files per thread split evenly. You then spawn the threads, and pass to each their work and let them run until they have finished their specific work.

You only have a certain amount of I/O bandwidth so having too many threads will not provide any benefits, also remember that creating a thread also takes a small amount of time.

Chris
is this an application of a semaphore class?
washtik
No its just a threading model, but a concise one. Theory goes that having more threads than the cores in your CPU is a waste anyway. I usually opt for 2xCoreCount. Also your hard drive will probably be the biggest bottle neck so having any more will reap no benefits. There is no need for a threadpool as you have a static thread count, each doing preset work.
Chris
i can't see an issue with simply assigning all the tasks to the ThreadPool and then letting it determine the operation. Doesn't it do all the throttling, thread control under the hood?
washtik
A: 

Use ThreadPool.QueueUserWorkItem to execute each independent task. Definitely don't create hundreds of threads. That is likely to cause major headaches.

Marcelo Cantos
A: 

The general rule for using the ThreadPool is if you don't want to worry about when the threads finish (or use Mutexes to track them), or worry about stopping the threads.

So do you need to worry about when the work is done? If not, the ThreadPool is the best option. If you want to track the overall progress, stop threads then your own collection of threads is best.

ThreadPool is generally more efficient if you are re-using threads. This question will give you a more detailed discussion.

Hth

Chris S
+1  A: 

Instead of having to deal with threads or manage thread pools directly I would suggest using a higher-level library like Parallel Extensions (PEX):

var filesContent = from file in enumerableOfFilesToProcess
                   select new 
                   {
                       File=file, 
                       Content=File.ReadAllText(file)
                   };

var processedContent = from content in filesContent
                       select new 
                       {
                           content.File, 
                           ProcessedContent = ProcessContent(content.Content)
                       };

var dictionary = processedContent
           .AsParallel()
           .ToDictionary(c => c.File);

PEX will handle thread management according to available cores and load while you get to concentrate about the business logic at hand (wow, that sounded like a commercial!)

PEX is part of the .Net Framework 4.0 but a back-port to 3.5 is also available as part of the Reactive Framework.

Peter Lillevold
it did sound like a commercial. i was targeting 2.0 of the framework, PEX is for 3.5? Perhaps I should just move with the times and start developing for a higher version of the framework!
washtik
See my updated answer. And yeah, these are great days to move to the latest bits :)
Peter Lillevold
Think I misplaced `AsParallel`, it should convert the processedContent collection.
Peter Lillevold
+1  A: 

I suggest using the CCR (Concurrency and Coordination Runtime) it will handle the low-level threading details for you. As for your strategy, one thread per work item may not be the best approach depending on how you attempt to write to the dictionary, because you may create heavy contention since dictionaries aren't thread safe.

Here's some sample code using the CCR, an Interleave would work nicely here:

Arbiter.Activate(dispatcherQueue, Arbiter.Interleave(
    new TeardownReceiverGroup(Arbiter.Receive<bool>(
        false, mainPort, new Handler<bool>(Teardown))),
    new ExclusiveReceiverGroup(Arbiter.Receive<object>(
        true, mainPort, new Handler<object>(WriteData))),
    new ConcurrentReceiverGroup(Arbiter.Receive<string>(
        true, mainPort, new Handler<string>(ReadAndProcessData)))));

public void WriteData(object data)
{
    // write data to the dictionary
    // this code is never executed in parallel so no synchronization code needed
}

public void ReadAndProcessData(string s)
{
    // this code gets scheduled to be executed in parallel
    // CCR take care of the task scheduling for you
}

public void Teardown(bool b)
{
    // clean up when all tasks are done
}
SpaceghostAli
That looks really ugly compared to PEX, or simple ThreadPool.QueueUserWorkItem. Anyway, I didn't know about it; thanks for sharing this! (+1)
ShdNx
A: 

In the long run, I think you'll be happier if you manage your own threads. This will let you control how many are running and make it easy to report status.

  1. Build a worker class that does the processing and give it a callback routine to return results and status.
  2. For each file, create a worker instance and a thread to run it. Put the thread in a Queue.
  3. Peel threads off of the queue up to the maximum you want to run simultaneously. As each thread completes go get another one. Adjust the maximum and measure throughput. I prefer to use a Dictionary to hold running threads, keyed by their ManagedThreadId.
  4. To stop early, just clear the queue.
  5. Use locking around your thread collections to preserve your sanity.
ebpower
+8  A: 

I would suggest you to use ThreadPool.QueueUserWorkItem(...), in this, threads are managed by the system and the .net framework. The chances of you meshing up with your own threadpool is much higher. So I would recommend you to use Threadpool provided by .net . It's very easy to use,

ThreadPool.QueueUserWorkItem(new WaitCallback(YourMethod), ParameterToBeUsedByMethod); 

YourMethod(object o){ Your Code here... }

For more reading please follow the link http://msdn.microsoft.com/en-us/library/3dasc8as%28VS.80%29.aspx

Hope, this helps

sumit_programmer