views:

349

answers:

6

What are good resources describing process, architecture, and design patterns for developing safety-critical systems?

A: 

Safety critical systems arent safe because of just one thing - they're safe because of the processes used, rather than the techniques used.

Safety should be at the forefront of every process, from design to documentation to QC to verification.

Visage
wow way to miss the point of the question
Nuno Furtado
+2  A: 

There is quite a lot of literature on this topic in print. Take a look at amazon.com for books with 'high availability' or 'safety critical' in the title. Also, much of the documentation on this subject lives in peer-reviewed literature. Computer Science journals such as Communications of the ACM are a good source of papers on the subject.

This Stackoverflow posting has quite a lot of fanout about tooling for safety-critical software. (Disclaimer: I wrote the accepted answer)

ConcernedOfTunbridgeWells
+4  A: 

G'day,

But you've pretty much hit the nail on the head with what you've alluded to in your question.

Namely, that it isn't any single magic ingredient that makes software suitable for safety critical systems. It is quite a range of various techniques and processes.

A couple of major points are having:

  • Reviews and sign off of:
    • requirements,
    • design,
    • code,
    • test plans,
    • etc.
  • Code quality metrics that are clearly expressed at the beginning of coding covering such things as:
    • max. allowed cyclomatic complexity,
    • max. allowed distance between declaration and use of a variable,
    • max. allowed level of indentation,
    • etc.
  • Peer reviews of code.

As an example, I worked on a replacement display system for an existing system at a European on-route air traffic control centre.

The first fifteen months were spent gathering the requirements from the existing system. Coding took six months and final testing took another two months.

All requirements were fully traceable through the code and all documents had signoff and acceptance by all parties.

The system was intended to run in parallel with the existing system for three months but the ATC agency was so impressed with the stability and performance of the new system that they put it completely online after only one month. And, apart from being air traffic controllers, who are renowned for being very conservative, they were German air traffic controllers!

How reliable was it? Well it was supposed to be 99.995% available, i.e. only down for about 2 minutes per month, but they have now seen 99.9995% which translates as only down for about 13 seconds per month!

BTW You might like to have a look at this question on Safety Critical Systems Development.

HTH

cheers,

Rob Wells
That is impressive!
Nelson Reis
Please look at the standards such as DO178B and you get all this.
Akshar Prabhu Desai
+5  A: 

Here are some online resources I have found recently (along with a list of some of the patterns mentioned in each):

Brandon E Taylor
+1  A: 

The Nasa mission critical software process works off 4 propositions:

  1. The product is only as good as the plan for the product.
  2. The best teamwork is a healthy rivalry.
  3. The database is the software base.
  4. Don't just fix the mistakes -- fix whatever permitted the mistake in the first place.

Consider these stats : the last three versions of the program -- each 420,000 lines long-had just one error each. They must be doing something right.

There is a very good article explaining these propositions here:

"They Write the Right Stuff"

Obviously this cost a lot of money!

Pablojim
Yes, but... "September 30, 1999 (CNN) -- NASA lost a $125 million Mars orbiter because one engineering team used metric units while another used English units for a key spacecraft operation, according to a review finding released Thursday."
Nosredna
@Nosredna - but that wasn't the "on-board shuttle group" ;-) I realise there is a probably a bit of marketing in the article but I think it is instructive to see how Nasa structures its mission critical software processes.
Pablojim
Well, I remember when the very first launch of the very first shuttle was delayed because of computer problems. Sure, NASA has systems, but it also has a history of notorious computer bugs.
Nosredna
+1  A: 

Despite this being some kind of overlooked issue of systems and safety engineering you might want to check out my site dealing with human-computer interaction design patterns for safety-related systems. The patterns I discuss there evolved out of my doctoral thesis.

intuio