views:

137

answers:

8

Hi

Apologies if this has already been covered or you think it really belongs on wiki.

I am a software developer at a company that manufactures microarray printing machines for the biosciences industry. I am primarily involved in interfacing with various bits of hardware (pneumatics, hydraulics, stepper motors, sensors etc) via GUI development in C++ to aspirate and print samples onto microarray slides.

On joining the company I noticed that whenever there was a hardware-related problem this would cause the whole setup to freeze, with nobody being any the wiser as to what the specific problem was - hardware / software / misuse etc. Since then I have improved things somewhat by introducing software timeouts and exception handling to better identify and deal with any hardware-related problems that arise eg PLC commands not successfully completed, inappropriate FPGA response commands, and various other deadlock type conditions etc. In addition, the software will now log a summary of the specific problem, inform the user and exit the thread gracefully. This software is not embedded, just interfacing using serial ports.

In spite of what has been achieved, non-software guys still do not fully appreciate that in these cases, the 'software' problem they are reporting to me is not really a software problem, rather the software is reporting a problem, but not causing it. Don't get me wrong, there is nothing I enjoy more than to come down on software bugs like a ton of bricks, and looking at ways of improving robustness in any way. I know the system well enough now that I almost have a sixth sense for these things.

No matter how many times I try to explain this, nothing really penetrates. They still report what are essentially hardware problems (which eventually get fixed) as software ones.

I would like to hear from any others that have endured similar finger-pointing experiences and what methods they used to deal with them.

UPDATE Some great responses here that pretty much sing from the same hymn sheet: be more descriptive. I guess identifying the command and bombing out cleanly when the hardware fails was the first stage, but was still not quite enough. The next stage will be to map what are to the layman fairly meaningless PLC commands to something more suggestive. "PLC Command M71 timeout" becomes "Failure to initialize syringe system. Check adequate vacuum reached" and so on...

+2  A: 

You could try labeling the error messages as "HARDWARE PROBLEM". Might get your point across.

Joeri Sebrechts
Simple, basic and brilliant. Nice one!
AndyUK
But be careful -- you of course don't want to pre-emptively label every error as hardware problem. Then you will lose credibility. I don't think Joeri was suggesting it, but I wanted to say it to be clear.
MJB
Yes, fair comment. It's just that in certain situations such as when an axis fails to get to reach its zero position, this has ALWAYS been due to some defect in the hardware calibration phase. If this is not set up right, the software will always fail. From my observations, its the same 5 or 6 specific problems cropping up repeatedly, and this could be restricted to just these instances.
AndyUK
I didn't mean to sweep everything under the hardware problem rug. I've just had issues in the past with errors that were really due to external databases being incorrectly configured, and I only solved the never-ending stream of complaints by relabeling the error as "database error". If in doubt, you can call it "Unknown error, suspected hardware calibration failure"
Joeri Sebrechts
+1  A: 

There's no such thing as non-software problem in a system. Software is the boss, and the boss cannot blame failure for the tools.

If underlying hardware is malfunctioning, it should report to the user what exactly went wrong with which component. If it didn't, it is a software problem.

For example, TCP disconnection means it have to reconnect. If it's an FPGA response, it should tell exactly what were the inputs and the outputs to the user, and who is to blame. If not, this is a software problem.

Pavel Radzivilovsky
I agree! But this why I have gotten the system to report which specific PLC/FPGA command it is having problems with...
AndyUK
+1  A: 

Perhaps when reporting the problem either as a message to the user or an entry in the log file you need to make it explicitly clear that it's the hardware that's at fault:

"Stepper motor not responding".

Unfortunately, because it's the software that people see and interact with they assume that the software is all that there is.

ChrisF
+1  A: 

I agree with the other posters, but I wanted to add another perspective: It could be worse. They could be attempting to solve the hardware problems for days or weeks, and then find out later, when everyone is under the gun and has been going crazy about it not getting fixed, that they were addressing the wrong problem and it was, in fact, a software problem. So count your blessings. If they always classify it as a software problem, at least you know about it. Only then can you troubleshoot, maybe put in additional problem-solving or problem-identifying code, and make the system a tiny bit better.

Also, this is pretty much the same as every software developer everywhere has ever faced. Except usually it is the software versus the user, not the software versus the hardware. And in that case, it appears there is no known solution. Lots of ways to address the problem, but no way to fix it. Thus the ever-growing list of acronyms describing how to blame the user without being rude: ID-ten-T error, PICNIC, PEBKAC, etc.

MJB
A: 

Test-oriented development (not necessary means 'test-driven') is want you should resourced to.

Basically, every sub-systems should have a reasonably thorough set of unit tests to identify problem before integration. Every time a problem occurs test the hardware so you can know for sure (or almost sure) that it is the hardware problem. This means that hardware must be designed in the way that it can be thoroughly tested.

I was a integration head for my college robot team and this tactic helps a lot.

Hope this helps.

NawaMan
I don't think this is the problem OP was discussing. It sounds like you are addressing problems occurring in development, and OP was discussing problems occurring in production. So even if the tests work, and the code was correct, there can still be a failure when running in the real world.
MJB
+1  A: 

"If what you're doing isn't working, stop doing it and try something else"

As pointed out in other comments, it's a communcation and to a lesser extent, perception problem. People will blame what they don't understand FAR more easily to make themselves feel like a victim. A motor could be sparking, throwing fire and explode from someone grossly overloading a feeder (with EVERY warning not to plastered all over it) -- but if that software stops responding, guess what caused the problem?

Since giving every one of your users a EE and CS class or 10 is completely out of the question, fall back on good ole communication. The basis of which is 4 things (mostly my opinion) in no particular order - What you observe, what you feel, what you think and what should be done. So with this idea, I'll put into practice by giving this response.

It seems like your users like to blame software when some of the underlying hardware is the key issue (observe). Trying to explain this with the users about this is impractical and a waste of time, that's not their job and most of them won't care (feel). What you may want to try is talking with the engineering team about the parts they're using and look into things that work better with software in general. Maybe there's some constraints of the inputs that were never considered? (think) Changing out the hardware or just a better understanding of it might be the real answer as well as more targeted errors and feedback to those users (done).

jeriley
+1  A: 

Who is it who's reporting the problems?

If it's the end users, I think this is a non-issue. They just know that what they're trying to do is not working. It's not the user's responsibility to diagnose the problem. All they know is, "I tried to do X, Y should have happened, but instead Z happened." Everything beyond that is your problem.

If the hardware folks are insisting that the problem is in the software and the software folks are insisting that the problem is in the hardware, then you need to enhance the software to diagnose errors more precisely, as ChrisF and others have noted.

If the higher-ups are blaming the software group for problems that are the responsibility of the hardware group and you're sick of taking the blame for other people's mistakes, okay, I understand that. Again, as the software guy, you have the power to create more precise error messages. If you can explicitly say, "Stepper motor not responding" or whatever, then you have the "moral authority" to insist that someone run diagnostics on the stepper motor. Just saying, "I'm pretty sure it's a hardware problem" isn't going to win an argument.

Jay
A: 

First, make sure your users are more likely to read and understand your error messages. Displaying "FPGA command GS_WIDGIT_FROB returned invalid response 0xFF45001C. Shutting down controller id 576D. (Error 1Xf)" might be great for you. But, the user is likely to hit "Ok" without reading it. Even if they do read it, it tells them no useful information. Either way, you're getting a phone call. Display "Widgit Frobber requires maintenance", but still log all the heavy details somewhere, and you're likely to get less calls.

Second, you know it's a hardware problem so do something about it! Have your software email hardware support, or whatever it takes to get the problem fixed. If the user is forced to decide what action to take to fix it, you can bet they'll get it wrong at least some of the time. If the user sees "Widgit Frobber requires maintenance. Hardware support has been notified (ticket #234)" they know that they don't have to do a thing.

Joe

related questions