views:

167

answers:

3

I have an appliction right now that is a pipeline design. In one the first stage it reads some data and files into a Stream. There are some intermediate stages that do stuff to the stream of data. And then there is a final stage that writes the stream out to somewhere. This all happens serially, one stage completes and then hands off to the next stage.

This all has been working just great, but now the amount of data is starting to get quite a bit larger (hundreds of GB potentially). So I'm thinking that I will need to do something to alleviate this. My initial thought is what I'm looking for some feedback on (being an independent developer I just don't have anywhere to bounce the idea off of).

I'm thinking of creating a Parallel pipeline. The Object that starts off the pipeline would create all of the stages and kick each one off in it's own thread. When the first stage gets the stream to some certain size then it will pass that stream off to the next stage for processing and start up a new stream of its own to continue to fill up. The idea here being that the final stage will be closing out streams as the first stage is building a new ones so my memory usage would be kept lower.

So questions: 1) Any high level thoughts on directions for this design? 2) Is there a simpler approach that you can think of that might apply here? 3) Is there anything existing out there that does something like this that I could reuse (not a product I have to buy)?

Thanks,

MikeD

A: 

For the design you've suggested, you'd want to have a good read up on producer/consumer problems if you haven't already. You'll need a good understanding of how to use semaphores in that situation.

Another approach you could try is to create multiple identical pipelines, each in a separate thread. This would probably be easier to code because it has a lot less inter-thread communication. However, depending on your data you may not be able to split it into chunks this way.

Tom Dalling
+1  A: 

The producer/consumer model is a good way to proceed. And Microsoft has their new Parallel Extensions which should provide most of the ground work for you. Look into the Task object. There's a preview release available for .NET 3.5 / VS2008.

Your first task should read blocks of data from your stream and then pass them onto other tasks. Then, have as many tasks in the middle as logically fit. Smaller tasks are (generally) better. The only thing you need to watch out for is to make sure the last task saves the data in the order it was read (because all the tasks in the middle may finish in a different order to what they started).

The Parallel Extensions looks very promising. I'm thinking I can create a Task for each stage, start each one up and use the new Concurrent Collection classes to pass the Streams between stages. I didn't really want to use VS 2010 yet, and I can't seem to find the preview release anymore. I'll keep looking.
MikeD
The download link for the preview is below, but MS seems to have changed their website and the link doesn't work any more :-(http://www.microsoft.com/downloads/details.aspx?FamilyId=348F73FD-593D-4B3C-B055-694C50D2B0F3
A: 

In each stage, do you read the entire chunk of data, do the manipulation, then send the entire chuck to the next stage?

If that is the case, you are using a "push" technique where you push the entire chunk of data to the next stage. Are you able to handle things in a more stream like manor using a "pull" technique? Each stage is a stream, and as you read data from that stream, it pulls data from the previous stream by calling read on it. As each stream is being read, it reads from the previous stream in small bits, processes it and returns the processed data. The destination stream determines how many bytes to read from the previous stream, and you don't ever have to consume large amounts of memory. This is how applications like BizTalk work. There are some blogs about how BizTalk Pipeline streams work, and I think it might be exactly what you want.

Here's a multi-part blog entry that you might find interesting:

Part 1
Part 2
Part 3
Part 4
Part 5

Jeremy
Uhhh, why the downvote? What's wrong with this answer?
Jeremy