You have processes that do things - worker processes. There may be many workers of a type - all the same - but there may also be many types of workers.
You build an application by writing types of worker processes and deploying them.
Overseeing the worker processes are supervisor processors - and overseeing the supervisor processes are supervisor processes (turtles all the way up, except for the top one who's the daddy!)
All supervisors are the same. They only have 2 jobs:
- look out for their workers (if they
start crashing restart them in the
way that that type of worker needs
to be restarted)
- if too many workers crash too often
report up the line to their
supervisors (by crashing and letting
their supervisor restart them in the
way that they need to be restarted)
That's it. You build small sub-systems out of special types of worker processes that you have designed and compose them into large, multi-server clusters using the same nearly-bug-free, comprehensively-tested supervisors as everyone else and some standard workers that operate on the supervision tree to do things like move sub-systems from one machine to another (these standard workers are codified in behaviours like OTP applications and OTP gen_servers and stuff.