views:

30

answers:

1

We have an existing system that processes a lot of files on an ongoing basis. Roughly speaking, about 3 million files a day that can range in size from a few kilobytes to in excess of 50 MB. These files go through a few different stages of processing from the time they are received to when they are finished being consumed, depending on the path they take. Due to the content and format of these files, they can NOT be broken up into smaller chunks.

Currently, the workflow these files move through is rigid and dictated by the code with fixed inputs and outputs (in many cases, where one subscriber becomes the publisher for a new set of files). This lack of flexibility is starting to cause us issues however so I'm looking at some kind of pub/sub solution for being able to handle new requirements.

Most traditional pub/sub solutions have the data within the actual payload, but the large potential file sizes exceed the limits of many messaging platforms. Furthermore, we have multiple platforms in play: files progress through both Linux and Windows tiers depending on their path.

Does anyone have any design and/or implementation recommendations with the following goals in mind?
1. Multiplatform for both pub and sub (Linux and Windows)
2. Persistent storage/store-and-forward support
3. Can handle large event payloads and appropriately cleans up once all subscribers have been serviced
4. Routing/workflow is done via configuration
5. Subscribers can subscribe to a filtered set of published events based on changing criteria (e.g. only give me files of a specific type)

I've done a bunch of digging into a number of service bus and MQ implementations, but haven't quite been able to firm up enough of a design approach to properly evaluate what tools make the most sense. Thanks for any input.

A: 

A1. I developed similar system on my previous job. We didn't pass the multi-MB payload inside the message, instead we stored it on the file server, and only passed the UNC file name (the messaging was Java RMI, but pretty much anything will work).

A2. I recently started to use Windows Communication Foundation. Fortunately for me, I'm only supporting Windows, and I don't need such big messages. However the documentation says the protocol is platform-independent, and there's the option to pass huge chunks of data using its streaming message transfer feature.

In both cases, I think you'll have to fulfill your #4 and #5 requirements in your own code.

Soonts
Interesting. I thought of passing the filepath, but I'm not sure how to reconcile that with the unknown number of subscribers problem -- how do I know when the last subscriber has gotten the file (so that I can remove it)? Thanks for the WCF pointer on streaming messages. I'm hoping requirements 4 and 5 could potentially be met by some existing implementation, but maybe that's being overly wishful.
Joe
We were lucky: we only had a ~dozen of processing stages, with very few branches. And, we had a definite moment of time when the processing of bunch of items is over.
Soonts
I think your #4 and #5 are too project-specific to be part of a framework. I saw several products that did routing/workflow via configuration, they all use different approaches: integrated MS Visio in BizTalk, huge XMLs in some Java framework, sophisticated custom UI in Virtools, etc. IMO that was because the logic behind routing and subscription is very tightly coupled with the nature of the things you're processing.
Soonts