fault-tolerance

How to robustly, but minimally, distribute items across a peer-to-peer system

If one has a peer-to-peer system that can be queried, one would like to reduce the total number of queries across the network (by distributing "popular" items widely and "similar" items together) avoid excess storage at each node assure good availability to even moderately rare items in the face of client downtime, hardware failure, ...

Fail fast finally clause in Java

Is there a way to detect, from within the finally clause, that an exception is in the process of being thrown? ie: try { // code that may or may not throw an exception } finally { SomeCleanupFunctionThatThrows(); // if currently executing an exception, exit the program, // otherwise just let the exception thrown by the...

How can I simulate ext3 filesystem corruption?

I would like to simulate filesystem corruption for the purpose of testing how our embedded systems react to it and ultimately have them fail as gracefully as possible. We use different kinds of block device emulated flash storage for data which is modified often and unsuitable for storage in NAND/NOR. Since I have a pretty good idea of ...

How do I automatically re-establish a duplex channel if it gets faulted?

Hi, I'm developing a client/server application in .Net 3.5 using WCF. Basically, a long running client service (on several machines) establish a duplex connection to the server over a netTcpBinding. The server then uses the callback contract of the client to perform certain on-demand oparations, to which the client responds in an asynch...

How does the HP (Tandem) Non stop compare with Linux clusters ?

Non Stop systems are known for their high availability and reliability, and higher price. How do Linux or Unix based clusters compare with them, in these respects and others? ...

Fault (radiation) tolerant soft core?

Hi everybody, I've a question... is there a certification or something that decides if a soft core is fault tolerant or not? and another question...I've seen that LEON3-FT is radiation tolerant only implementd on RTAX Actel FPGA. Is it right? Excuse me but I'm confusing about it becuase somebody speaks about LEON3-FT (fault tolerant) ...

Fault tolerant software architecture

I'm looking for some good articles on fault tolerant software architectures. Could I please have some recommendations. ...

What's up with the [OptionalField] Attribute?

As I understand it I have to adorn a new member in a newer version of my class with the [OptionalField] Attribute when I deserialize an older version of my class that lacks this newer member. However, the code below throws no exception while the InnerTranslator property was added after serializing the class. I check for the property to ...

Robust fault tolerant MySQL replication

Is there any way to get a fault tolerant MySQL replication? I am in an environment that has many networking issues. It appears that replication gets an error and just stops. I need it to continue to work and recover from these faults. There is some wrapper software that checks the state of replication and restarts it in the case of losin...

How do supervisor processes monitor processes? Can the same be done on the JVM?

Erlang fault tolerance (as I understand it) includes the use of supervisor processes to keep an eye on worker processes, so if a worker dies the supervisor can start up a new one. How does Erlang do this monitoring, especially in a distributed scenario? How can it be sure the process has really died? Does it do heart beats? Is someth...

What are the cases that cause WCF proxy to be faulted?

I want to know what are the cases in which WCF proxy (generated by vs2008 or svcutil) becomes faulted (fault state)? so I can recreate new instance and avoid use the faulted one. currently I am handling TimeoutException,FaultException,CommunicationObjectAbortedException try { client.Method1(args)...

How does Google App Engine infrastructure is fault tolerant?

Hi everybody, I am actually implementing a web application on Google App Engine. This has taken me for the moment a huge time in re-designing the database and the application through GAE requirements and best practices. My problem is this: How can I be sure that GAE is fault tolerant, or at what degree is it fault tolerant? I didn't ...

Fault Tolerant Computing Learning Resources.

Hi All, I am planning to take a course on “Fault Tolerant Computing” Does anybody know some good learning resource about this subject? Public domain books/tutorials would be very handy. Thanks ...

What Linux tools are available to monitor/configure deployed code?

I'm writing some telecommunications software, and must devise a way to monitor and configure the software after it has been deployed on a server. The company I work for currently has an in-house solution, but we're exploring other options. What tools are available that can do the following (preferably all in one package): 1) Deliver s...

catastrophic disasters due to software system failures

I know this is not a programming problem, but this problem is related to computer systems I am posting this question. Can somebody tell me a good place to find out information related to catastrophic disasters due to software system failures? For example incidents like Therac-25. The risk digest is a good place, but information it prov...

Do I absolutely need a minimum of 3 nodes/servers for a Cassandra cluster or will 2 suffice?

Surely one can run a single node cluster but I'd like some level of fault-tolerance. At present I can afford to lease two servers (8GB RAM, private VLAN @1GigE) but not 3. My understanding is that 3 nodes is the minimum needed for a Cassandra cluster because there's no possible majority between 2 nodes, and a majority is required for r...

Resources about crash-safe and fault-tolerance programming

I like the LWN article "Crash-only software" and I would like to learn more about crash-safe and fault-tolerant programming. It is surprisingly hard to assure that the persistent state is consistent in fault situations. Here I do not even talk about distributed operations: That is hard on a single node, too: Even the normal Berkeley DB ...

Articles about replication schemes/algorithms?

I'm designing a distributed system with a certain flow of data in it. I'd like to guarantee that at least N nodes have almost-current data at any given time. I do not need complete consistency, only eventual consistency (t.i. for any time instant, the current snapshot of data should eventually appear on at least N nodes. It is tricky to ...

Software Fault Tolerance

Hi All, Does anyone know how software fault tolerance is implemented in Air Traffic Control Systems? Some URLs would be very helpful. ...

Good scalable fault-tolerant in-memory database with LINQ support for .NET

Are there are good in-memory transactional databases that support LINQ and SQL Server persistance? I'd like to create a full representation of a large data store in memory and have it commit to a SQL Server Database in a lazy fashion, but still keep some level of fault tolerance by scaling it out horizontally. I don't want to rely on n...