I am about to embark on writing a system that needs to re-balance it's load distribution amongst the remaining nodes once one of more of the nodes involved fail. Anyone have any good references on what to avoid and what works?
In particular I'm curious how one should start in order to build such a system to to be able to unit-test it.