views:

155

answers:

5

I would like to describe some specifics of my program and get feedback on what the best multithreading model to use would be most applicable. I've spent a lot of time now reading on ThreadPool, Threads, Producer/Consumer, etc. and have yet to come to solid conclusions.

I have a list of files (all the same format) but with different contents. I have to perform work on each file. The work consists of reading the file, some processing that takes about 1-2 minutes of straight number crunching, and then writing large output files at the end.

I would like the UI interface to still be responsive after I initiate the work on the specified files.

Some questions:

  1. What model/mechanisms should I use? Producer/Consumer, WorkPool, etc.
  2. Should I use a BackgroundWorker in the UI for responsiveness or can I launch the threading from within the Form as long as I leave the UI thread alone to continue responding to user input?
  3. How could I take results or status of each individual work on each file and report it to the UI in a thread safe way to give user feedback as the work progresses (there can be close to 1000 files to process)

Update:

Great feedback so far, very helpful. I'm adding some more details that are asked below:

  • Output is to multiple independent files. One set of output files per "work item" that then themselves gets read and processed by another process before the "work item" is complete

  • The work items/threads do not share any resources.

  • The work items are processed in part using a unmanaged static library that makes use of boost libraries.

+1  A: 

Typically you should use BackgroundWorker for background processing for a UI, as this is what the class is specifically designed to do. And typically a thread pool is used for server applications.

You could try using multiple BackgroundWorkers to accomplish what you need to do. Just add all of the files to a Queue, and then spawn a BackgroundWorker to read from the Queue and process the next file. You could probably spawn up to n workers to process multiple files at a time; you would just need some means of tracking which worker is handling each file so that you report meaningful progress to the UI.

To identify what work each worker is doing, you can pass an argument to RunWorkerAsyncwhich identifies the thread. That argument may then be accessed in DoWork via the DoWorkEventArgs.Argument property. To know which worker reports back progress, you can add an event handler for each one separately and/or pass an object to ReportProgress which identifies the worker.

Does that help?

Justin Ethier
Yes, it does. How do I keep track of the multiple BW's? I just tried moving from a single member to a List<BackgroundWorker>. One problem is how do I access the specific worker via the DoWork method to call ReportProgress or check for Cancellation. Within DoWork I don't know which worker I am executing under. Does that make sense?
user144182
Yes, I just updated the answer to include more information.
Justin Ethier
@Justin Ethier - the Object^ sender argument in RunWorkerAsync is, in fact, the BackgroundWorker that was just started. I am using the DoWorkEventArgs to specify data specific to what list of files each BW thread is processing.
user144182
+1  A: 

Update based on comments:
I don't agree with the statement that a ThreadPool will not be able to handle the workload you're encountering... let's look at your problem and get more specific:
1. You have almost 1000 files.
2. Each file might take up to 2 minutes of CPU-intensive work to process.
3. You want to have parallel processing to increase throughput.
4. You want to signal when each file is complete and update the UI.

Realistically you don't want to run 1000 threads, because you're limited by the number of cores you have... and since it's CPU intensive work you are likely to max out the CPU load with very few threads (in my programs it's usually optimal to have 2-4 threads per core).

So you shouldn't load 1000 work items in the ThreadPool and expect to see an increase of throughput. You'll have to create an environment where you're always running with an optimal number of threads and this requires some engineering.

I'll have to contradict my original statement a little bit and actually recommend a Producer/Consumer design. Check out this question for more details on the pattern.

Here is what the Producer might look like:

class Producer
{
    private final CountDownLatch _latch;
    private final BlockingQueue _workQueue;
    Producer( CountDownLatch latch, BlockingQueue workQueue)
    {
        _latch = latch;
        _workQueue = workQueue;
    }

    public void Run()
    {
        while(hasMoreFiles)
        {
            // load the file and enqueue it
            _workQueue.Enqueue(nextFileJob);
        }

        _latch.Signal();
    }
}

Here is your consumer:

class Consumer
{
    private final CountDownLatch _latch;
    private final BlockingQueue _workQueue;

    Consumer(CountDownLatch latch, BlockingQueue workQueue, ReportStatusToUI reportDelegate)
    {
        _latch = latch;
        _workQueue = workQueue;
    }

    public void Run()
    {
        while(!terminationCondition)
        {
            // blocks until there is something in the queue
            WorkItem workItem = _workQueue.Dequeue();

            // Work that takes 1-2 minutes
            DoWork(workItem);

            // a delegate that is executed on the UI (use BeginInvoke on the UI)
            reportDelegate(someStatusIndicator);
        }

        _latch.Signal();
    }
}

A CountDownLatch:

public class CountDownLatch
{
    private int m_remain;
    private EventWaitHandle m_event;

    public CountDownLatch(int count)
    {
        Reset(count);
    }

    public void Reset(int count)
    {
        if (count < 0)
            throw new ArgumentOutOfRangeException();
        m_remain = count;
        m_event = new ManualResetEvent(false);
        if (m_remain == 0)
        {
            m_event.Set();
        }
    }

    public void Signal()
    {
        // The last thread to signal also sets the event.
        if (Interlocked.Decrement(ref m_remain) == 0)
            m_event.Set();
    }

    public void Wait()
    {
        m_event.WaitOne();
    }
}

Jicksa's BlockingQueue:

class BlockingQueue<T> {
    private Queue<T> q = new Queue<T>();

    public void Enqueue(T element) {
        q.Enqueue(element);
        lock (q) {
            Monitor.Pulse(q);
        }
    }

    public T Dequeue() {
        lock(q) {
            while (q.Count == 0) {
                Monitor.Wait(q);
            }
            return q.Dequeue();
        }
    }
}

So what does that leave? Well now all you have to do is start all your threads... you can start them in a ThreadPool, as BackgroundWorker, or each one as a new Thread and it doesn't make any difference.

You only need to create one Producer and the optimal number of Consumers that will be feasible given the number of cores you have (about 2-4 Consumers per core).

The parent thread (NOT your UI thread) should block until all consumer threads are done:

void StartThreads()
{
    CountDownLatch latch = new CountDownLatch(numConsumer+numProducer);
    BlockingQueue<T> workQueue = new BlockingQueue<T>();

    Producer producer = new Producer(latch, workQueue);
    if(youLikeThreads)
    {
        Thread p = new Thread(producer.Run);
        p.IsBackground = true;
        p.Start();
    }
    else if(youLikeThreadPools)
    {
        ThreadPool.QueueUserWorkItem(producer.Run);
    }

    for (int i; i < numConsumers; ++i)
    {
        Consumer consumer = new Consumer(latch, workQueue, theDelegate);

        if(youLikeThreads)
        {
            Thread c = new Thread(consumer.Run);

            c.IsBackground = true;

            c.Start();
        }
        else if(youLikeThreadPools)
        {
            ThreadPool.QueueUserWorkItem(consumer.Run);
        }
    }

    // wait for all the threads to signal
    latch.Wait();

    SayHelloToTheUI();
}

Please not that the above code is illustrative only. You still need to send a termination signal to the Consumer and the Producer and you need to do it in a thread safe manner.

Lirik
Given the amount of processing I'll be doing in my threads the general consensus I read was to stay away from the ThreadPool. Do you agree?
user144182
I'll update my answer based on your comment.
Lirik
So given the amount you processing you're free to use a ThreadPool, a BackgroundWorker or a new Thread and you are not going to see a difference in throughput or performance (but it will be maximized) given the architecture I've illustrated.
Lirik
Thank you so much. Can you give me a hint on how to implement the termination signal?
user144182
The Producer is easy to terminate: you just exit the while loop when you've loaded the last file. The Consumer is tricky (because it will block if there is no data), so you have to interrupt each thread when you see that the work queue is empty. Interrupting can be done by calling the Interrupt method on each Consumer thread. You should surround the Consumer's while loop with a try/finally block and in the finally block you should call latch.Signal(). I'm not sure how to interrupt an item on the ThreadPool, but I posted this question in hopes that somebody else will: http://tiny.cc/aejzA
Lirik
@Lirik your answer has provided the most help so far, if only in leading me to learn more about thread safety and synchronization. A followup and corollary question I am asking that relates to this question is here: http://stackoverflow.com/questions/2448928/how-to-make-stack-pop-threadsafe
user144182
@user260197 I'm glad my answer was helpful... please don't forget to accept the best answer by clicking on the check mark next to the best answer. The BlockingQueue I provided is not optimal, but on your new question I've posted one which is better than this one.
Lirik
A: 

I do agree with Justin Ethier. The BackgroundWorker is an easy-to-play-with threading tool.

I understand you're facing a situation where you wonder what threading model to use. Thus, it depends on the objects you're working with. Let me explain.

Even though you would like to use let's say a careless threading model where the developer doesn't need to worry about threadsafety, if your objects or libraries are not threadsafe, you will need to use lock()s on such objects before they can get available for the following thread. For instance, .NET 3.5 collections are not threadsafe.

Here's a related question that should help, in addition that there is an explanation from Eric Lippert himself! I also recommend that you see his blog on MSDN.

Hope this helps!

Will Marcouiller
A: 

BackhgroundWorker sounds reasonable.
The main question is how many should run in parallel as your task seems to be more IO then CPU prone, plus you might gain by reading and writing to different IO devices.

weismat
+1  A: 

I would not use the background worker -- that ties your processing to the Winform UI layer. If you want to create a non-visual class that handles the threading and processing, you are best off using the Threadpool.

I would use the Threadpool vs. "straight" threads, as .Net will do some load balancing with the pool, and it recycles the threads so that you don't have to incur the cost of creating threads.

If you are using .Net 4, you might have a look at the new parallel threading library, I think it wrappers a lot of hte producer/consumer stuff.

You probably do want to use some sort of "throttle" to control how fast you are processing files (you probably don't want all 1000 files loaded into memory at once, etc). You might consider a producer/consumer pattern where you can control how many threads are processing at a time.

For thread-safe updates back to the UI, use the InvokeRequired and Invoke/BeginInvoke members on the Winforms controls.

Edit -- code example My example is simpler than Lirik's but it doesn't do as much either. If you need a full producer/consumer, go with what Lirik wrote. From your question, it seems like you want to build a list of files, and them off to to some other component, and let those files be processed in the background. If that's all you want to do, you probably don't need a full producer/consumer.

I'm assuming that this is some sort of batch operation, and that once the user starts it, they will not be adding more files until the batch finishes. If that's not true, you might be better off with a producer/consumer.

This example can be used with a Winform, but you don't have to. You could use this component in a service, a console app, etc:


    public class FileProcessor
    {
        private int MaxThreads = System.Environment.ProcessorCount;
        private volatile int ActiveWorkers;

        // you could define your own handler here to pass completion stats
        public event System.EventHandler FileProcessed;

        public event System.EventHandler Finished;

        private readonly object LockObj = new object();
        private System.Collections.Generic.Queue Files;

        public void ProcessFiles(System.Collections.Generic.Queue files)
        {
            this.Files = files;
            for (int i = 0; i < this.MaxThreads; i++)
                System.Threading.ThreadPool.QueueUserWorkItem(this.ProcessFile);
        }

        private void ProcessFile(object state)
        {
            this.IncrementActiveWorkers();
            string file = this.DequeueNextFile();
            while (file != null)
            {
                this.DoYourWork(file);
                this.OnFileProcessed(file);
                file = this.DequeueNextFile();
            } 
            // no more files left in the queue
            int workers = this.DecrementActiveWorkers();
            if (workers == 0)
                this.OnFinished();
        }

        // please give me a name!
        private void DoYourWork(string fileName) { }

        private void IncrementActiveWorkers()
        {
            lock (this.LockObj)
            {
                this.ActiveWorkers++;
            }
        }

        private int DecrementActiveWorkers()
        {
            lock (this.LockObj)
            {
                this.ActiveWorkers--;
                return this.ActiveWorkers;
            }
        }

        private string DequeueNextFile()
        {
            lock (this.LockObj)
            {
                // check for items available in queue
                if (this.Files.Count > 0)
                    return this.Files.Dequeue();
                else
                    return null;
            }

        }

        private void OnFileProcessed(string fileName)
        {
            System.EventHandler fileProcessed = this.FileProcessed;
            if (fileProcessed != null)
                fileProcessed(this, System.EventArgs.Empty);
        }

        private void OnFinished()
        {
            System.EventHandler finished = this.Finished;
            if (finished != null)
                finished(this, System.EventArgs.Empty);
        }
    }

Since you said "specified files", I'm assuming that your Winform app has some sort of grid or listbox or other control that a user interacts with to select the files that are to be processed.

Here's an example of how to use it:


public class MyForm...
{
  public void Go()
  {
     Queue files = new Queue();
     // enqueue the name/path of all selected files into the queue...
     // now process them
     FileProcessor fp = new FileProcessor();

     // example of using an event
     fp.Finished += this.FileProcessor_Finished;

     fp.ProcessFiles(files);
  }

  private void FileProcessor_Finished(object sender, System.EventArgs e)
  {
     // this event will have been called by a non-ui thread.  Marshal it back to the UI
     if(this.InvokeRequired)
       this.Invoke(FileProcessor_Finished, new object[] {sender, e});
     else
     {
        // handle the event -- this will be run on the UI thread.
     }
  }
}

JMarsch
I do need to throttle for performance - ideally I want one file being processed at a time per processor. How would I use producer/consumer using the thread pool? One problem I had with thread pool was I queued all my work items and then used WaitAll on their ManualResetEvents, but WaitAll is limited to 64 handles.
user144182
How would I not be tied to the Winform layer somehow if I'm initiating the threading as a result of the user interface? I thought the whole point of the BackgroundWorker is to decouple execution of some task from the UI thread. Whether that thread is encapsulated by the BW or a ThreadPool both are decoupled. Or am I seeing this incorrectly?
user144182
the background worker decouple's your work thread from the UI thread, but it doesn't decouple your CODE from the UI thread (what if you decided to process those files in a Windows Service instead of a Winform app? What if you wanted to do it from a Console app, or a WPF app? The Background worker is Winform-specific, and it ties you to a Winform Solution. That's what I meant about coupling.Edited - with a code example.
JMarsch