Restore of data replicated with erasure-coding

I need your help on the design of the restore-procedure in a backup system I'm building. The prerequisites are the following:

A restore can be made of one or several files.
A file consists of 1 or more data-blocks (on average 16mb in size, minimum 4mb maximum 64mb).
A data-block is replicated using erasure coding into replication-blocks (256KB big).
The replication blocks are spread out across lots of unreliable nodes (i.e. you can never expect a response from any of the nodes).
Each node will store at most one replication-block for a data-block and all files will have basically the same set of nodes who store their data-blocks.
It's written in Java using Apache Mina as network library. If you are not familiar with Apache Mina then pseudo-code will do just fine! The only thing you need to know is that Apache Mina uses asynchronous message passing.

My question is: how should I design the restore of the files? I'm looking for a general design suggestions like, what should be done concurrently (files, data-blocks or replication-blocks), what patterns I should use (consumer-producer, just a big thread-pool with tasks, something else?). Any tips and suggestions are appreciated! Potential problems which the design needs to avoid:

Too many requests for replication-blocks are sent simultaneously so that either the remote nodes become overloaded or you run out of memory because the queue becomes too big.
It sends too few requests simultaneously and is therefore not using all the available bandwidth.
You run out of resources because too many threads are running simultaneously.

Other requirements are:

If not enough nodes provide a response in order for the data-block to become restored then resends of the requests must be made after a certain timeout.
You must be able to track the progress of a restore so that the user knows what's happening.
Since all nodes store one replication-block for each data-block but all files (and thus also data-blocks) share pretty much the same nodes, it is desirably to have good concurrency for the data-blocks as well. Otherwise you spend too much time waiting between requests for nodes with high bandwidth (since you can only request one replication-block per data-block and node).

I have a couple of ideas on how to implement it but I rather see your guys ideas' before affecting them with my bias.

ansaurus

tags:

views:

answers:

Restore of data replicated with erasure-coding

related questions