views:

349

answers:

5

In Toyota manufacturing lines they always know what path a part have traveled. Just so they can be sure they can fix it of something goes wrong. Is this applicable in software too?

All error messages should tell me exactly what path they traveled. Some do, the error messages with stack trace. Is this a correct interpretation? Could it be used some where else?

Ok, here is the podcast. I think it is interesting

http://itc.conversationsnetwork.org/shows/detail3798.html

A: 

This is a good approach. But be aware that you shouldn't over-do logging. Otherwise you couldn't find the interesting informations in all the noise, and it reduces the overall performance (e.g. anonymous object creation, depending on the language).

akr
+2  A: 

It's less vital with software. If something goes wrong in software, you can usually reproduce the fault and analyse it in captivity. Even if it only happens 1 time in 1000, you can often switch on all the logging and run it 1000 times (a simple soak test).

That's much more expensive and time-consuming on a manufacturing line, to the point of being impossible.

Having as much information available as possible the first time it goes wrong is no bad thing, but it's not as important to me as it is to Toyota.

Steve Jessop
A: 

Producing error messages with a full stack trace is usually bad security practice.
On the other hand, and more in line with Toyota's intent, every developed module should be traced back to the original programmer(s) - and they should be held accountable for shoddy work, bug fixes, security vulnerabilities, etc. Not for disciplinary purposes, but both maintenance, and education if necessary. And maybe for bonuses, in the contrary situation... ;-)

AviD
Good way of looking at it - svn blame has been the shortcut to understanding many a piece of under-documented code. Assuming the culprit is still around.
Steve Jessop
+5  A: 

A good idea where practicable. Unfortunately, it is usually prohibitively difficult to keep track of the entire history of the state of the machine. You just can't tag each data structure with where you got it from, and the entire state of that object. You might be able to store just the external events and in that way reproduce where everything came from.

Some examples:

I did work on a project where it was practicable and it helped immensely. When we were getting close to shipping, and running out of bugs to fix, we would have our game play in "zero players mode", where the computer would repeatedly play itself all night long with all variations of characters and locales. If it asserted, it would display the random key that started the match. When we came to work in the morning we'd write the key down from our screen (there usually was one) and start it again using that key. Then we'd just watch it until the assert came up, and track it down. The important thing is that we could recreate all the original inputs that led to the error, and rerun it as many times as we wanted, even after recompiles (within limits... the number of fetches from the random number generator could not be changed, although we had a separate RNG for non-game stuff like visual fx). This only worked because each match started after a warm reboot and took only a very small amount of data as input.

I have heard that Bungie used a similar method to try to discover bad geometry in their Halo levels. They would set the dev kits running overnight in a special mode where the indestructable protagonist would move and jump randomly. In the morning they'd look and see if he got stuck in the geometry at some location where he couldn't get out. There may have been grenades involved, too.

On another project we actually logged all user interaction with a timestamp so we could replay it. That works great if you can, but most people have interactions with a changing DB whose entire state might not be stored so easily.

Mark Santesson
Good point. I also used this "keep info around" approach for a processing tool, so that errors from input that caused the output to be corrupt or just fail late could be tracked (eg line of input file where error supposedly is in).
steffenj
Mark, I was reading this answer and I thought, "I've seen that done before." Then I saw your name I realized we had worked together.
Nosredna
A: 

Reminds me of this talk over at Google Video about "debugging backwards in time". I rarely (never) use debuggers (and could not stand the jocular speaker) so I skipped the talk. Perhaps it is interesting to you?

pi