fault-tolerance

How to determine Reliability for TMR ,Pair and spare,Hybrid ,time redundancy tolerant methods??

Can anyone help me to calculate Reliability for such methods I have a project , So if also can compute the power cost (%) for redundancy methods ...

What alternatives do I have if I want a distributed multi-master database?

I will build a system where I want to reduce single-point-of-failures, and I need a database. Is there any (free) relational database systems that can handle multi-master setups good (i.e where it is easy to add and remove nodes) or is it better to go with a NoSQL-database? As what I have understood, a key-value store will handle this b...

Testing fault tolerant code

I’m currently working on a server application were we have agreed to try and maintain a certain level of service. The level of service we want to guaranty is: if a request is accepted by the server and the server sends on an acknowledgement to the client we want to guaranty that the request will happen, even if the server crashes. As req...

Error monitoring/handling on webservers

Hi everybody, We have a web server that we're about to launch a number of applications onto. They will all share database and memcached servers, but each application has it's own mySQL database and all memcached keys per application, is prefixed. Possible scenario: If a memcached server in our cluster goes boom, we want someone (opera...

Best Practices of fault toleration and reliability for scheduled tasks or services

I have been working on many applications which run as windows service or scheduled tasks. Now, i want to make sure that these applications will be fault tolerant and reliable. For example; i have a service that runs every hour. if the service crashes while its operating or running, i d like the application to run again for the same peri...

Are Erlang/OTP messages reliable? Can messages be duplicated?

Long version: I'm new to erlang, and considering using it for a scalable architecture. I've found many proponents of the platform touting its reliability and fault tolerance. However, I'm struggling to understand exactly how fault-tolerance is achieved in this system where messages are queued in transient memory. I understand that a ...

Exception handling in a real time, SQL-Server driven system

Hi, I have developed a report viewer in .NET Winforms (it just runs queries and displays results). This works against a reporting database. However, the above is a small subset of a much larger application, which gets data from another database. It looks like this: Monitored system has a change in state (e.g. latency increases) => Eve...

How do I get a fault-tolerant web service client?

Is there any framework, which generates fault-tolerant web service clients? That means I don't have to regenerate the classes because of minor changes. Any programming language would be fine as a source of inspiration. Following changes of the web service shouldn't need a regeneration of the client: New optional method parameters. Man...

Scala + Akka: How to develop a Multi-Machine Highly Available Cluster

We're developing a server system in Scala + Akka for a game that will serve clients in Android, iPhone, and Second Life. There are parts of this server that need to be highly available, running on multiple machines. If one of those servers dies (of, say, hardware failure), the system needs to keep running. I think I want the clients t...

design patterns for transactional services with checkpoints and recovery

I have a multistep process where each step does some network IO (web service call) and then persists some data. I want to design it in a fault tolerant way so that if the service fails, either because of a system crash or one of the steps fails, I am able to recover and re-start from the last error free step. Here is how I am thinking o...