I have an application that processes files in a directory and moves them to another directory along with the processed output. Nothing special about that. An interesting requirement was introduced:
Implement fault tolerance and processing throughput by allowing multiple remote instances to work on the same file store.
Additional considerations are that we can not assume the file system, as we support both Windows and NFS.
Of course the problems is, how do I make sure that the different instances do not try and process the same work, potentially corrupting work or reducing throughput? File locking can be problematic, especially across network shares. We can use a more sophisticated method, such as a simple database or messaging framework, (a la JMS or similar), but the entire cluster needs to be fault tolerant. We can't have one database or messaging provider because of the single point of failure that it introduces.
We've implemented a solution that uses multicast messages to self-discover processing instances and elect a supervisor who assigns work. There's a timeout in case the supervisor goes down and another election takes place. Our networking library, however, isn't very mature and the our implementation of messages is clunky.
My instincts, however, tell me that there is a simpler way.
Thoughts?