views:

122

answers:

5

Most bugs are fairly simple, easily reproducible, and easy to debug. What do you do when you run into ones that are hard or impossible to repro under debugger, i.e.one of these?

Our app is a multi-threaded app that is furthermore complicated by the fact that it communicates with multiple clients via remoting and sometimes there're bugs that can take weeks to track down, and sometimes we can't even be sure that the problem is fixed because of it's inconsistent nature, it could be that it's just coincidence that the issue hasn't been seen for a while.

We already have an error reporting system so if we're lucky and the bug throws an exception we'll get a stack trace, but even that's not always enough because it's not obvious from the stack for instance how a certain value turned out to be null (for example). This is especially true when an exception occurs in a worker thread (which is the case most times.

And then you have ones that don't even throw exceptions, it's just unexpected behavior. But it only happens a small percentage of times.

This is in .NET so some of the memory/pointer work is hidden away, but we have many 3rd party components that aren't managed code and a fair amount of COM interop so it still gets a little tricky.

Obviously there's no straightforward answers since I'm not asking about a specific bug but what are some general concepts principles and tactics to go about tackling these kinds of problems?

+1  A: 

Well, I think some of this should be a design consideration, what some might call an "Enterprise Concern" - the inclusion of half decent logging/tracing and instrumentation will help immensely in debugging (especially with a configurable verbosity!). Even throwing a couple of custom performance counters in to the application can sometimes help debug race conditions.. if you have to go to extremes.

The second thing is more of an approach or mindset - try to rule out components and environmental concerns (perhaps one at a time), it'll help you narrow down the potential causes of exceptions and other issues.

Lastly, having a good test environment where you can try to reproduce the same conditions & erros is a big help at times, even if you have to simulate it rather the re-create a physical network of computers, for example.

RobS
A: 

I know of one programmer who leaves his units tests in his production code and provides the ability to run them (via a switch). He logs all the failures and then can review them.

Obviously, this could be seem as little controversial, but he says it's a big help in getting feedback on how the system operates in the "real world".

dommer
A: 

With most bugs you find them after the event, and you fix them by trying to re-create the circumstances. Simple bug = simple recreation.

The key to your issue is being able to recreate the circumstances. With a complex environment like this, I'm thinking the only way you can do this is to take every interface point that could fail, and implement logging for that interface, ie dump to file and/or DB. Of course you wouldn't want this turned on all the time, but you have to code it in at the start. Then setup a test environment that can then be driven from log data, in this way you can run and re-run circumstances until you re-create the bug, and then you're 80% of the way to solving it.

MrTelly
A: 

In an application of the type you describe, my epxerience is that it's so easy for cohesion to slip away, and coupling to creep in; and the problems show up most often in the boundaries between the pieces.

For me (given my incompletely developed guru skills), I find I need to start finding ways to simplify the code (so I can grok more of it at a time), improve the cohesiveness of the pieces (so the scope of investigation can be limited as much as possible), and do team code reviews of the threaded pieces.

And to reinforce the point, don't make any changes that don't simplify the design.

Come to think of it, a lot of the patterns for this are described in "Refactoring".

le dorfier
A: 

Systematic logging may help:

  • Log at interfaces so you can narrow it down to one component.

  • Log internal state changes when they are hard to deduce from what happens at the interfaces.

Sometimes also some kind of snapshot of the system when a failure happened may help, if feasible. On small embedded systems this may be a memory dump, in Java it might be the thread state, or you might implement dumping appropriate state on command.

starblue