Firstly, my scenario:
An application with data input, data processing and result aggregation. Input data can be huge so it is cut into batches. Batches are individual. Data processing becomes batch processing and is a relatively time-consuming procedure. Finally, each batch processing would result in a batch result. All batch results are required to be aggregated and a single final aggregated result is output.
Discussion 1: what is a reasonable good design to turn the above workflow into multithreading and parallelism.
Task Parallelism Library introduced in .NET 4 would be my choice.
My solution would be turning the workflow into a producer-consumer way. 1st, the creation of the batches would be done in a task. The batch creation is done sequentially and single-threaded because it involves sequentially reading from a single data file. This task is called "batch producer". The batches produced are put into a BlockingQueue data structure. This is the queue that holds the "goods" for consumers to consume. Comparing to batch processing, batch producer is less computational intensive, which means if without any control, the batch produce speed is faster than batch processing speed. We can control the "batch producer" by setting a capacity of the BlockingQueue. When the queue is full of batches, the "batch producer" would not work and would go asleep. BlockingQueue is the ideal data structure provided by TPL that is suitable for producer-consumer pattern implementation.
To maximally utilize multiple cores of computational resources of modern computers, I would like to use multiple consumers to execute the batch processing in parallel. In traditional standard textbook, there is one producer and one consumer. Here, there is one producer and multiple consumers. Each consumer grabs a batch from the queue and process it. I will call these consumer tasks "batch workers".
For the aggregation, I would use another producer-consumer cycle. But now, the producer(s) become the "batch workers". I use another BlockingQueue to store the batch results produced from the "batch workers". I create another task, whose job is to grab a batch result from the queue and aggregate it with the aggregated result held by it. I call this task "aggregator". Note that the aggregator is also computationally less intensive comparing to "batch workers". I can apply a control on how fast the "batch workers" can produce batch results by setting a capacity on this batch result queue, similar to the batch queue mentioned before.
As everyone can see, to parallel implement the original workflow, I use one task (thread) for batch generation, multiple tasks (threads) to consume the batches and one task (thread) to aggregate the results. There are two producer-consumer cycles, in one cycle, "batch producer" produces and "batch workers" consume; in another cycle, "batch workers" produce and "aggregator" consumes. I have three controls on this design: A) the capacity of the batch queue, B) the capacity of the batch result queue and C) the number of "batch workers".
Discussion 2: what is the faults of this design?
Considering my design, I would like to raise some more discussions:
Discussion 3: cancellation support. The cancellation can happen in the middle of batch generation, batch processing or batch aggregation. When cancellation is detected in one task, it needs to propagate to all other tasks so all of the tasks can cancel themselves in time. Obviously, CancellationTokenSource and CancellationToken are useful. Some tricky things need to pay attention. For example, when the queues are full, the producers are asleep and cancellation needs to "wake them up". Any code can be shared showing how to do this properly would be greatly welcomed.
Discussion 4: exception handling. Again, the exception can happen in any tasks and once an exception is detected from one task, it needs to propagate to all other tasks so all the them can cancel their jobs. Any code can be shared showing how to do this properly would be of great help.