views:

151

answers:

3

Has anyone created or seen a good fault diagnostic procedure for a web based solutions that an Operations team could use to do diagnostic and support with?

The solution is based on a C# system running on IIS and making use of things like workflow and WCF services. It's a Service Based solution and also makes use of external and Internal (Client authored) services.

The idea is to provide them with a Fault Diagnostic Tree (Something like a flow chart) which is then accompanied by a textual bit that give them more information and/or a Checklist to run through.

The entire system is absolutely mission critical so any faults must be resolved ASAP.

Are there any standardised methodologies for creating something like this? A Google response with allot from the Automotive Industry, but all the Software related stuff requires me to subscribe to some arbitrary site or buy something I have never seen before.

Any help would be much appreciated!

<Update> Just to be clear... I need a guide for what to do in the event of a fault outside the application programming itself. There will always be a point where the customer would be requested to phone us if some strange exception occurs, but what process should they go through before doing that?

How detailed should this be? What is the starting point? (Right when the call comes in or after the ops person has made sure that it is an actual fault that requires immediate attention?) Should I assume that the Ops guide will be used by people with very little technical ability? What about authority? Can I assume they have access to restart services/databases and/or IIS or should I not care and let them build an escalation process around this document?</Update>

+3  A: 

The industry standard for situations like this is to call the Developers at home as soon as the problem occurs. It won't take very long before the developers then produce the necessary tools. They might even start producing applications that are much easier to diagnose. I've even seen a case where a developer in this situation has actually started recommending that other developers do the exact same thing.

Believe me, this works!


Ok, it's later in the day now, I'm wider awake, and will try to make this brutally clear.

You've already failed. It's too late. You didn't design your application with manageability in mind, and now you're trying to write documentation to make up for that lack. You're going to have to guess what can go wrong, and guess what to say to Operations staff that you do not know, so that they will do what you guess will be the right thing. Good luck with the guessing part.

I don't believe you can succeed by guessing. The technique that I suggested somewhat tongue in cheek actually worked. In retrospect, one of the reasons it worked, is that it took guesswork out of the equation. We had actual Developers being called at 2 in the Morning with actual critical customer problems. This gave actual intelligent people incentive to solve, after the fact, problems which should have been solved before the fact, but which were not.

Earlier would have been better, but later worked out as well. It worked by substituting a dynamic process for a static one, by depending on humans to act in their best interest, and by permitting them to succeed by doing so. It still amazes me that I came up with a distributed test framework for testing "clusters" of nodes connected over TCP/IP, and collecting details about failures; all in Perl and shell scripts. I don't know that I could do that today, even with better tools, but I had a damned good incentive to get it done back then!

Since you don't already know the answer to your question (and unless someone proves me wrong and gives it to you), I propose that you put into place a process that will permit you to learn the answers. It might be necessary for you to "prime the pump" by producing a near-worthless document that perhaps does nothing more than telling Operations how to do things like collect log files when a problem occurs. But over time, if you engage with Operations when a problem happens (especially at 2 in the Morning), then you will learn what else should have been in the document; and you will have heavy incentive to put it there!

Besides, if you ask for time and a half to be on call for 2am problem calls, then your management will have great incentive to let you fix the problem - so they can stop paying you so much.

John Saunders
True.... I’ll copy this text into the Ops Guide. It should make for a very concise 1 page document. ;-)
Gineer
No, no, this has to be a _living_ document. The ops guide should have a link to a page with the developer's home and cell phone numbers, with instructions to call whenever there's a problem; especially if the developers are already asleep (or worse). Try to create a fluid system, where action (a bug) causes reaction (Developer awakened to fix the bug). Do this, and there will be a second-level reaction: Developers produce fewer bugs, in order to not be awakened to fix them.
John Saunders
Of course, I must admit that it was me in that situation before, and that it worked. We got that code so clean, that the customer no longer felt he needed to pay for a dedicated Engineering team. Immediately after that happened, I was laid off (with a nice package). As a result, I had the leisure time necessary to learn this new ".NET" thing Microsoft was working on, and the rest is history. I tell you, this works. ;-)
John Saunders
Stellar answer John!
Dan F
+1  A: 

Our helpdesk uses a wiki for this.

Jordan Stewart
That works well for software that will be consumed in house. In this case, the software will be shipped to a client site and needs a prebuilt guide for fault diagnostics when they unwrap the box.
Gineer
+1  A: 

The main problem with well-defined diagnotics is that by the time you can write the problem and its solutions down unambiguously, you have also written the specification for a self-repair function. So what point is there then left for the Operations Team?

The whole point of a human Operations Team is to do things that computers can't do, e.g. (1) to use their brains and (2) to check if the hardware is still alive.

MSalters
I think the point here is to provide insight into problems that aren'y programming problems. E.g. setup issues, environmental problems, data problems, etc.
Jordan Stewart