Likewise are there design patterns that should be avoided?
views:
594answers:
7High availability is more about hardware availability and redundancy than about coding conventions. There are a couple patterns that I would use in almost every HA case: I would choose the singleton pattern for my database object and use the factory pattern to create the singleton. The factory can then have the logic to handle availability issues with the database (which is where most availability problems happen). For instance, if the Master is down, then connect to a second Master for both reads and writes until the Master is back. I don't know if these are the most leveraged patterns, but they are the most leveraged in my code.
Of course this logic could be handled in a __construct method, but a factory pattern will allow you to better control your code and the decision-making logic of how to handle database connectivity issues. A factory will also allow you to better handle the singleton pattern.
I would absolutely avoid the decorator pattern, and the observer pattern. They both create complexity in your code that makes it difficult to maintain. Their are cases where these are the best choice for your needs, but most of the time they are not.
Wrong:
...and there will be a storage server
Good:
...and there will be a farm of (multiple) storage servers with (multiple) load balancers in front of them
Put load balancers in front of everything. For now You can have 4 backends, but in the future You can have 400 of them, so it's wise to only manage it on the LB, not all the apps that use the backend.
Use multiple levels of cache.
Look for popular solutions on speeding thigs up (memcached for example).
If You are going to renew a system, do it part-by-part, in multiple small steps. If You do it in one big step (turn off the old one, turn on the new one and pray it will work) it will most probably fail.
Use DNS names for stuff, f.e.
storage-lb.servicename
resolves to addresses of all storage loadbalancers. If You want to add one, just modify the dns, all the services will start using it automaticly.Keep It Simple. The more systems You depend on, the more Your service will suffer from it.
One approach to creating reliable software is crash-only software:
Crash-only software is software that crashes safely and recovers quickly. The only way to stop it is to crash it, and the only way to start it is to recover. A crash-only system is composed of crash-only components which communicate with retryable requests; faults are handled by crashing and restarting the faulty component and retrying any requests which have timed out. The resulting system is often more robust and reliable because crash recovery is a first-class citizen in the development process, rather than an afterthought, and you no longer need the extra code (and associated interfaces and bugs) for explicit shutdown. All software ought to be able to crash safely and recover quickly, but crash-only software must have these qualities, or their lack becomes quickly evident.
As I understand it, you're looking for specific patterns to use in java applications part of an HA architecture. Of course there's a numerous number of patterns and best practices that can be used, but these aren't really "HA patterns". Rather, they're good ideas that can be utilized in manys contexts.
I guess what I'm trying to say is this: A high availability architecture is composed of numerous small parts. If we pick one of these small parts and examine them, we'll probably find that there's no magical HA attributes to this small component. If we examine all the other components we'll find the same thing. It's when they're combined in an intelligent manner thay the become an HA application.
An HA application is an application where you plan for the worst from the beginning. If you ever think in terms of "This component is so stable that we don't need additional redundancy for it" it's probably not a HA architecture. After all, it's easy to handle the problem scenarios that you foresee. It's the one that surprises you that brings down the system.
Despite all this, there are patterns that are especially useful in HA contexts. Many of them are documented in the classic book "Patterns of Enterprise Application Architecture" by Martin Fowler.
I assume you are writing a server type application (lets leave Web apps for a while - there are some good off the shelf solutions that can help there, so lets look at the "i've got this great new type of server I have write", but I want it to be HA problem).
In a server implementation, the requests from clients are usually (in some form or another) converted to some event or command type pattern, and are then executed on one or more queue's.
So, first problem - need to store events/commands in a manner that will survive in the cluster (ie. when a new node takes over as master , it looks at the next command that needs executing and begins).
Lets start with a single threaded server impl (the easiest - and concepts still apply to multi-threaded but its got its own set of issues0. When a command is being processed need some sort of transaction processing.
Another concern is managing side effects and how do you handle failure of the current command ? Where possible, handle side effects in a transactional manner, so that they are all or nothing. ie. if the command changes state variables, but crashes half way through execution, being able to return to the "previous" state is great. This allows the new master node to resume the crashed command and just re-run the command. A good way again is breaking a side effects into smaller tasks that can again be run on any node. ie. store the main request start and end tasks, with lots of little tasks that handle say only one side effect per task.
This also introduces other issues which will effect your design. Those state variables are not necessarily databases updates. They could be shared state (say a finite state machine for an internal component) that needs to also be distributed in the cluster. So the pattern for managing changes such that the master code must see a consistent version of the state it needs, and then committing that state across the cluster. Using some form of immutable (at least from the master thread doing the update) data storage is useful. ie. all updates are effectively done on new copies that must go through some sort of mediator or facade that only updates the local in memory copies with the updates after updating across the cluster (or the minimum number of members across the cluster for data consistency).
Some of these issues are also present for master worker systems.
Also need good error management as the number of things that can go wrong on state update increases (as you have the network now involved).
I use the state pattern a lot. Instead of one line updates, for side effects you want to send requests/responses, and use conversation specific fsm's to track the progress.
Another issue is the representation of end points. ie. client connected to master node needs to be able to reconnect to the new master, and then listen for results ? Or do you simply cancel all pending results and let the clients resubmit ? If you allow pending requests to be processed, a nice way to identify endpoints (clients) is needed (ie. some sort of client id in a lookup).
Also need cleanup code etc (ie. don't want data waiting for a client to reconnect to wait forever).
Lots of queue are used. A lot of people will therefore using some message bus (jms say for java) to push events in a transactional manner.
Terracotta (again for java) solves a lot of this for you - just update the memory - terracotta is your facade/mediator here. They have just inject the aspects for your.
Terracotta (i don't work for them) - introduces the concept of "super static", so you get these cluster wide singletons that are cool, but you just need to be aware how this will effect testing and development workflow - ie. use lots of composition, instead of inheritance of concrete implementations for good reuse.
For web apps - a good app server can help with session variable replication and a good load balancer works. In someways, using this via a REST (or your web service method of choice) is a an easy way to write a multi-threaded service. But it will have performance implications. Again depends on your problem domain.
Messages serves (say jms) are often used to introduce a loose coupling between different services. With a decent message server, you can do a lot of msg routing (again apache camel or similar does a great job) ie. say a sticky consumer against a cluster of jms producers etc. that can also allow for good failover. Jms queue's etc can provide a simple way to distribute cmds in the cluster, indept of master / slave. (again it depends on if you are doing LOB or writing a server / product from scratch).
(if i get time later I will tidy up, maybe put some more detail in fix spelling grammar etc)
I'd recommend having a read of Release it! by Michael Nygard. He outlines a number of anti-patterns that impact production systems, and patterns to help prevent one errant component from taking the whole system down. The book covers three major areas; Stability, Capacity and General Design (covering Networking, Security, Availability and Administration).
My previous workplace was bitten (at one time or another) by pretty much every single failure scenario Nygard outlines (with loss of revenue for each resulting outage). Implementing some of the techniques and patterns he suggests resulted in significantly more stable and predictable systems (and yes, the book is a little Java centric, but the principles are applicable in many contexts).
Designing high availability (HA) systems is an active research and development area. If you look at ACM or IEEE, there are a ton of research papers on qualities of service (availability, reliability, scalability, etc.) and how to achieve them (loose coupling, adaptation, etc.). If you're looking more for practical applications, take a look at fault tolerant systems and middleware that is built to allow clustering, grid, or cloud like functionality.
Replication and load balancing (a.k.a. reverse proxy) are some of the most common patterns of achieving HA systems, and can often be done without making code changes to the underlying software assuming it is not too tightly coupled. Even a lot of the recent cloud offerings are achieved essentially through replication and load balancing, although they tend to build in elasticity to handle wide ranges of system demand.
Making software components stateless eases the burden of replication, as the state itself doesn't need to be replicated along with the software components. Statelessness is one of the major reasons that HTTP scales so well, but it often requires applications to add on their own state (e.g. sessions) which then needs to be replicated.
Therefore, it is easier to make loosely coupled systems highly available than tightly coupled systems. Since reliability of the system's components determine the overall system reliability, components that are unreliable may need to be replaced (hardware failures, software bugs, etc). Allowing for dynamic adaptation at runtime lets these failed components to be replaced without affecting the availability of the overall system. Loose coupling is another reason for the use of reliable messaging systems where the sender and receiver do not have to be available at the same time, but the system itself is still available.