Have you dealt with space hardening?

tags:

c
memory

views:

1194

answers:

+38 Q:

Have you dealt with space hardening?

I am very eager to study best practices when it comes to space hardening. For instance, I've read (though I can't find the article any longer) that some core parts of the Mars rovers did not use dynamic memory allocation, in fact it was forbidden. I've also read that old fashioned core memory may be preferable in space.

I was looking at some of the projects associated with the Google Lunar Challenge and wondering what it would feel like to get code on the moon, or even just into space. I know that space hardened boards offer some sanity in such a harsh environment, however I'm wondering (as a C programmer) how I would need to adjust my thinking and code if I was writing something that would run in space?

I think the next few years might show more growth in private space companies, I'd really like to at least be somewhat knowledgeable regarding best practices.

Can anyone recommend some books, offer links to papers on the topic or (gasp) even a simulator that shows you what happens to a program if radiation, cold or heat bombards a board that sustained damage to its insulation? I think the goal is keeping humans inside of a space craft (as far as fixing or swapping stuff) and avoiding missions to fix things.

Furthermore, if the board maintains some critical system, early warnings seem paramount.

How does one gain experience in this through testing and trial & error (barring the launch of your own personal satellite?)

+8 A:

I can't think of anything more important than providing a reliable low-level path to a) reset the device, and b) upload new code into the device.

Edmund 2009-02-23 07:26:52

Its making said path reliable that is fascinating me :)

Tim Post 2009-02-23 07:29:31

ISTR a problem with one probe whose failsafe command caused it to turn its antenna towards the sun, blinding it to the commands that were supposed to save it. I'm not sure if it was the same probe that was eventually recovered using a tiny backup antenna on the other end.

Edmund 2009-02-23 07:42:51

... and c) revert to the previous code, just in case...

Roger Lipscombe 2009-02-23 07:56:16

The JPL has used a Lisp REPL to great effect in the past.

Svante 2009-02-24 23:12:44

An interesting read: Lisping at JPL http://www.flownet.com/gat/jpl-lisp.html

J.F. Sebastian 2009-02-25 00:30:20

I recently read something about this in Wikipedia. There is some info about space/radiation hardening.

I think very redundant and error recoverable code is the key here, apart from the a special low level circuit and software design.

Hiperi0n 2009-02-23 13:07:43

+6 A:

Far more important than the programming language are the requirements on the underlying system (OS and Hardware). Basically, you need to ensure (and prove) deterministic and predictable behavior of the overall system. Much related research has been done in the real-time community. I strongly recommend reading two books if you really want to study this subject: Real-Time Systems by Jane Liu and a book with the same name by Hermann Kopetz. The former covers scheduling in a very theoretical fashion while the latter gets your feet back on the ground and pretty much covers all related aspects of (real-time) system design, e.g. fault tolerance.

Furthermore, the following two incidents nicely illustrate the quality of problems software engineers have to face when sending something into space:

Pankrat 2009-02-24 22:15:56

+1 A:

I believe the worse an environment is the more Error Correcting Codes is used, and there is ECC memories that can be used to some extent.

If one can estimate the level of errors one can construct a Error correcting code that can handle the introduced errors.

epatel 2009-02-24 22:37:21

+33 A:

Space software is not arcane magic. You are still using 0's and 1's, not 1's and 3's. So there’s probably no wow factor involved in describing what goes into developing software.

Some slight differences that come to mind at the moment are:

Extremely process oriented.
Space software will always have both software and hardware watchdog timers.
Every space system I’ve worked on was a hard real-time system.
You simulate (to great accuracy) every external actor to the system. This usually involves building (sometimes really expensive) custom hardware that is used solely for testing.
You spend enormous effort and expense doing formal testing.
The customer (usually JPL) is extremely involved in the test process.
You generally are using old and known compilers and development environments, rather than the new ones.
Code reviews, code reviews and code reviews.
You better be very comfortable switching between the hardware and software worlds. You don’t have to know how to design the hardware but you have to know how it works.
Extensive use of test equipment, like oscilloscopes, logic analyzers, synthesizers and spectrum analyzers.
At least 3 locations for storing the application program. The default is burned in ROM. This will never change. The other 2 are for the current version and the next/last version.
Failure analysis (MTBF) is really important.
Redundant systems and failover plans for the critical components.

Dunk 2009-02-24 22:52:21

Up to now, but wait until the memristor was coming!

lsalamon 2009-04-30 14:14:22

+4 A:

Space-proof computing systems are all about dependability. A deep introduction to the field can be found in Fundamental concepts of dependability by Algirdas Avižienis, Jean-Claude Laprie & Brian Randell.

Real-time is also a key concept for space computing. As Pankrat, I would recommend Real-Time Systems by Hermann Kopetz.

For giving a pragmatic sense of the challenges of space computing, think of:

extreme conditions in space: very hot when oriented to the sun, very cold the other side, many cosmic rays which may invert bits in memory, huge accelerations and vibrations when being lauched, ... Hardware for space must be far more robust than hardware for military.
When a failure occurs, except in the International Space Station or for Hubble Space Telescope, nobody comes and replace failed system. Everything must be fixed from ground through maximal observability and commandability and through spare systems to switch to. This is easy for Earth satellites. This is more difficult with space probes for which communication delays may be one hour long. Indeed, everything must be as reliable as possible in the first place.

mouviciel 2009-02-24 22:58:42

+15 A:

It can't be that hard, it isn't rocket science.

Oh wait...

Mark Ransom 2009-02-24 23:12:50

The hardest part is writing the special AttackByKlingonBattlecruiser exception handler.

Crashworks 2009-02-28 05:12:22

+3 A:

What I learned from the one project I was involved in as an intern:

Your hardware specs will change, usually for the worse!

For example, the space hardened CPU that was being used in the design was promised, promised, mind you, that it would run at 20 MHz.

The final result ran at 12 MHz. The senior programmer on the project spent a lot of time redesigning algorithms in order to meet the hard real time requirements of the control systems and much of the telemetry software ended up offloaded to a second system instead of running on the primary CPU.

So, try to leave some extra resources available in the original design and try not to use all the available CPU and memory.

Zan Lynx 2009-02-28 04:33:30

Yes, core memory is on the research boards.
Dynamic memory is not good for embedded systems. Reliability issues.

I would guess that software ECC of data and using information theory and a custom loder to spread the data around the system to manage the memory failures would be a start. But, I don't study rad-hard software so I'm not familiar with it, that's just a guess.

Paul Nathan 2009-02-28 05:12:12

+2 A:

For a software perspective, write a privileged task that occasionally, randomly, flips bits in your code, and see how it deals with that. That's your simulator.

Hardware-wise, the parts will be old, because it takes a long time to get something to be space-rated. Also, new parts are continually shrinking in size, and the smaller a feature is (think memory cell on an I.C.) the more susceptible it is to corruption from a radiation event.

gbarry 2009-02-28 07:03:10

+11 A:

To get tough environment reliability specifically in C, here are some really concrete things that I've seen done.

MISRA-C : The Automotive C subset. A bit like Ravenscar ADA/Java.

watchdogs : ensure the program doesn't lock up

ecc memory (sometimes)

checksums: looking for flipping bits. I've seen all three of these in one system:

1) checksum the program continuously (it was in EPROM but still got flipped bits).

2) checksum certain data structures periodically.

3) CPU sanity checks periodically.

4) check IO registers have what they are supposed to have in them.

4b) read back outputs onto independent inputs and verify.

Tim Williscroft 2009-03-04 04:24:16

And, have all the failure responses thoroughly planned, on the conviction that they will be needed.

Mike Dunlavey 2009-03-25 22:18:14

Failure responses are best put in the code. The error occurs at a time of it's choosing. Needs to report faults, epecially when recovered from.The machine has to cope for itself, up until to the point where the "computer fail" annunciator goes off.

Tim Williscroft 2009-03-25 23:54:34

+8 A:

I just stumbled into your interesting question.

I was at Instrumentation Lab during Apollo, and again later when it was called Draper Lab during the "cold war".

For the Apollo guidance computer, core was used for RAM, and a special braided core was used for ROM. The machine itself was made entirely out of NOR gates and was clocked quite slow, for reliability.

I didn't work directly on Minuteman missiles, but I was aware of some of the issues. If a nuclear warhead goes off in the vicinity of some electronics, it basically shorts it out. The guidance computer had a radiation sensor that would instantly shut off Vc so nothing would burn out. Then the computer would restart, having had its registers erased.

To handle this, the computer would periodically snapshot its registers into core, and when it restarted it would start up from that checkpoint. To make this work, the software (all in ASM) had to be analyzed to see that it could take any number of such hits, at any frequency, without getting wrong answers. That was called being "restart protected". Very interesting problem, given that (thank goodness) it never had to be used.

Mike Dunlavey 2009-03-25 20:29:48

+1 A:

I worked on a safety critical device and we had to go through some similar hoops.

We had safety critical variables. There was a copy of the inverse of the variable. After each loop, the variable was checked against its inverse.

We had a walking ones and zeros test of ALL registers. That included the Program Counter!

We had a test of all the opcodes of the micro instruction set. We had to be sure that if you added 2 registers, the registers were added.

Some of this is probably not related to programs in space, but it gives you a sense of the magnitude of checking that is possible.

Robert 2009-04-30 13:58:47

ansaurus

tags:

views:

answers:

Have you dealt with space hardening?

related questions