Theory:
This is leader election, which is a form of the Consensus Problem, also sometimes called The Two Generals Problem. Under some sets of assumptions (fully async and messages can be lost) it's been proven impossible, and the proof is particularly elegant.
The intuition of this problem is: imagine some algorithm exists that allows consensus to be reached in some fixed number of messages. Since failures are tolerated, we can drop one message from the protocol, and it should still work. We can repeat this process until there are no messages at all, clearly an impossibility.
In practice we overcome this using failure detectors to simulate a synchronous system.
The most widely known algorithm that solves consensus is Paxos, which can tolerate failure of up to half of the participating nodes. Paxos has the reputation of being very difficult to implement as even slight misunderstandings of the details of the protocol destroy it's correctness.
Practical solutions:
While the problem in general is quite difficult, getting working systems up is far easier. There are off the shelf implementations of Paxos or equivalent algorithms available. Apache Zookeeper is the best I'm aware of. For your specific problem, I'm pretty sure it'll be your quickest route. Other Paxos implementations are around, and it also might be possible to build something on network redundancy virtual ip tools like Wackamole. I believe the high end versions of most commercial databases offer quorum features as an (expensive) option.
Also, for many applications it's acceptable to weaken correctness slightly or otherwise adjust the problem to allow much simpler solutions.
For example, if a single point of failure is tolerable because recovery is likely to be quick, then the problem is trivial: just have one special node do the work.
Another approach might be to build the system around idempotent actions, so duplicate processing becomes tolerable.
Lastly you might partition the workload into a pool of non-redundant systems: here failures will delay processing until recovery but only for items at that node, not for the entire workload.
These sorts of compromises are so much simpler that they're often a better choice. One has to weigh the utility of a full solution against the complexity of implementing it and see if there's really value. This is why so many practical systems just use 2 Phase or 3 Phase Commit, even though they block in some scenarios: the decreased availability is tolerable compared to the complexity of a full quorum system.