We aim to implement a distributed system on a cluster, which will perform resource-consuming image-based computing with heavy storage I/O, having following characteristics:
- There is a dedicated manager computer node and up to 100 compute nodes. The cluster must be easily expandable.
- It is built around job-task concept. A job may have one to 100,000 tasks.
- A job, which is initiated by the user on the manager node, results in creation of tasks on the compute node.
- Tasks create other tasks on the fly.
- Some tasks may run for minutes, while others may take many hours.
- The tasks run according to a dependency hierarchy, which may be updated on the fly.
- The job may be paused and resumed later.
- Each task requires specific resources in terms of CPU (cores), memory and local hard disk space. The manager should be aware of this when scheduling tasks.
- The tasks tell their progress and result back to the manager.
- The manager is aware if the task is alive or hanged.
We found Windows HPC Server 2008 (HPCS) R2 very close by concept to what we need. However, there are a few critical downsides:
- Creation of tasks is getting exponentially slower with increasing number of tasks. Submitting more than several thousands of tasks is unbearable in terms of time.
- Task is unable to report its progress back to the manager, only job can.
- There is no communication with the task during its runtime, which makes it impossible to check if the task is running or may need restarting.
- HPCS only knows nodes, CPU cores and memory as resource units. We can't introduce resource units of our own (like free disk space, custom hardware devices, etc).
Here's my question: does anybody know and/or had experience with a distributed computing framework which could help us? We are using Windows.