What are good resources describing process, architecture, and design patterns for developing safety-critical systems?
Safety critical systems arent safe because of just one thing - they're safe because of the processes used, rather than the techniques used.
Safety should be at the forefront of every process, from design to documentation to QC to verification.
There is quite a lot of literature on this topic in print. Take a look at amazon.com for books with 'high availability' or 'safety critical' in the title. Also, much of the documentation on this subject lives in peer-reviewed literature. Computer Science journals such as Communications of the ACM are a good source of papers on the subject.
This Stackoverflow posting has quite a lot of fanout about tooling for safety-critical software. (Disclaimer: I wrote the accepted answer)
G'day,
But you've pretty much hit the nail on the head with what you've alluded to in your question.
Namely, that it isn't any single magic ingredient that makes software suitable for safety critical systems. It is quite a range of various techniques and processes.
A couple of major points are having:
- Reviews and sign off of:
- requirements,
- design,
- code,
- test plans,
- etc.
- Code quality metrics that are clearly expressed at the beginning of coding covering such things as:
- max. allowed cyclomatic complexity,
- max. allowed distance between declaration and use of a variable,
- max. allowed level of indentation,
- etc.
- Peer reviews of code.
As an example, I worked on a replacement display system for an existing system at a European on-route air traffic control centre.
The first fifteen months were spent gathering the requirements from the existing system. Coding took six months and final testing took another two months.
All requirements were fully traceable through the code and all documents had signoff and acceptance by all parties.
The system was intended to run in parallel with the existing system for three months but the ATC agency was so impressed with the stability and performance of the new system that they put it completely online after only one month. And, apart from being air traffic controllers, who are renowned for being very conservative, they were German air traffic controllers!
How reliable was it? Well it was supposed to be 99.995% available, i.e. only down for about 2 minutes per month, but they have now seen 99.9995% which translates as only down for about 13 seconds per month!
BTW You might like to have a look at this question on Safety Critical Systems Development.
HTH
cheers,
Here are some online resources I have found recently (along with a list of some of the patterns mentioned in each):
- Design Pattern Representation for Safety-Critical Embedded Systems - Ashraf Armoush, Falk Salewski, Stefan Kowalewski
- Safety Executive
- Safety Kernel
- Acceptance Voting
- Safety-Critical Systems Design - Bruce Douglass
- Homogenous Redundancy
- Diverse Redundancy
- Monitor-Actuator
- Safety Executive
- Utilizing UML and patterns for safety critical systems - Kai Hansen and Ingolf Gullesen
- Dual-Channel
- Shadow Safety Diagnostic
- Safe Communication Subsystem
- Architecture of safety-critical systems - David Kalinsky
- Basic Shutdown Architecture
- Monitor-Actuator Architecture
- Dual-Channel Architecture
The Nasa mission critical software process works off 4 propositions:
- The product is only as good as the plan for the product.
- The best teamwork is a healthy rivalry.
- The database is the software base.
- Don't just fix the mistakes -- fix whatever permitted the mistake in the first place.
Consider these stats : the last three versions of the program -- each 420,000 lines long-had just one error each. They must be doing something right.
There is a very good article explaining these propositions here:
Obviously this cost a lot of money!