views:

600

answers:

18

Scenario

You've got several bug reports all showing the same problem. They're all cryptic with similar tales of how the problem occurred. You follow the steps but it doesn't reliably reproduce the problem. After some investigation and web searching, you suspect what might be going on and you are pretty sure you can fix it.

Problem

Unfortunately, without a reliable way to reproduce the original problem, you can't verify that it actually fixes the issue rather than having no effect at all or exacerbating and masking the real problem. You could just not fix it until it becomes reproducible every time, but it's a big bug and not fixing it would cause your users a lot of other problems.

Question

How do you go about verifying your change?

I think this is a very familiar scenario to anyone who has engineered software, so I'm sure there are a plethora of approaches and best practices to tackling bugs like this. We are currently looking at one of these problems on our project where I have spent some time determining the issue but have been unable to confirm my suspicions. A colleague is soak-testing my fix in the hopes that "a day of running without a crash" equates to "it's fixed". However, I'd prefer a more reliable approach and I figured there's a wealth of experience here on SO.

+3  A: 

There is no one answer to this problem. Sometimes the solution you've found helps you figure out the scenario to reproduce the problem, in which case you can test that scenario before and after the fix. Sometimes, though, that solution you've found only fixes one of the problems but not all of them, or like you say masks a deeper problem. I wish I could say "do this, it works every time", but there isn't a "this" that fits that scenario.

Paul Tomblin
Yes, I wasn't expecting a panacea - if one existed, I think we'd all know it by now, right? In my particular scenario, I have grasp of what may be happening, but getting the events to fall in the right order relies on Windows, making it hard confirm that what I think is happening is right.
Jeff Yates
+3  A: 
MadKeithV
Yup, that's good. We did that. We added some trace statements around the problem area and confirmed the neighbourhood of the bug, but we couldn't confirm the exact call under suspicion because the added statements seemed to change the timing. Race conditions suck.
Jeff Yates
Most of the applications I've worked on have extensive logging facilities through 3rd-party logging libraries. It can be a real help to have detailed logs from client PCs for fixing those elusive "sometimes" bugs.
MadKeithV
Yes, we have crash reporting and trace reporting in our client tool, but our particular issue seems to be somewhat complex. The crash is in A but the issue starts in B and is caused by C - all in different modules. I even know what changes highlighted the problem, but the problem was always lurking.
Jeff Yates
This works in many cases. In some cases the instrumentation can cause the bug to move. That is generally indicative of a memory issue of some sort.
EvilTeach
+4  A: 

You'll never be able to verify the fix without identifying the root cause and coming up with a reliable way to reproduce the bug.

For identifying the root cause: If your platform allows it, hook some post-mortem debugging into the problem.

For example, on Windows, get your code to create a minidump file (core dump on Unix) when it encounters this problem. You can then get the customer (or WinQual, on Windows) to send you this file. This should give you more information about how your code's gone wrong on the production system.

But without that, you'll still need to come up with a reliable way to reproduce the bug. Otherwise you'll never be able to verify that it's fixed.

Even with all of this information, you might end up fixing a bug that looks like, but isn't, the one that the customer is seeing.

Roger Lipscombe
I believe I've identified the root cause, but I have been unable to confirm it. So, assuming my root cause is correct, I want to be able to verify it (I know, big assumption, right?).
Jeff Yates
+6  A: 

Bugs that are hard to reproduce are the hardest one to solve. What you need to make sure that you have found the root of the problem, even if the problem itself cannot be reproduced successfully.

The most common intermittent bugs are caused by race-conditions - by eliminating the race, or ensuring that one side always wins you have eliminated the root of the problem even if you can't successfully confirm it by testing the results. The only thing you can test is that the cause does need repeat itself.

Sometimes fixing what is seen as the root indeed solves a problem but not the right one - there is no avoiding it. The best way to avoid intermittent bugs is be careful and methodical with the system design and architecture.

Eran Galperin
I am certain it is a race condition, but I have struggled to confirm the root cause (though I am reasonably sure). The problem is that it seems a number of events are needed for the race to occur making it reasonably complex. Thus, confirming the suspected root cause has proved difficult.
Jeff Yates
This might mean the design could be improved so that those conditions could be isolated. With some refactoring you might be able to confirm better what is the root cause.
Eran Galperin
Refactoring can certainly help, but the main problem seems to be based on the order that Windows receives and processes messages while the user clicks things. We haven't identified what exact order that is. Fun fun fun.
Jeff Yates
I guess some inside knowledge of windows events/timings would help. Maybe you can ask a separate question on that, somebody here might know something
Eran Galperin
+1  A: 

Those types of bugs are very frustrating. Extrapolate it out to different machines with different types of custom hardware that might be in them (like at my company), and boy oh boy does it become a nightmare. I currently have several bugs like this at the moment at my job.

My rule of thumb: I don't fix it unless I can reproduce it myself or I'm presented with a log that clearly shows something wrong. Otherwise I cannot verify my change, nor can I verify that my change has not broken anything else. Of course, it's just a rule of thumb - I do make exceptions.

I think you're quite right to be concerned with your colleuge's approach.

unforgiven3
Unfortunately, we can't release with this bug, so we need to come up with something. We've tried it on different machines and some do exhibit it more than others, which has helped.
Jeff Yates
A: 

These problems have always been caused by:

  1. Memory Problems
  2. Threading Problems

To solve the problem, you should:

  • Instrument your code (Add log statements)
  • Code Review threading
  • Code Review memory allocation / dereferencing

The code reviews will most likely only happen if it is a priority, or if you have a strong suspicion about which code is shared by the multiple bug reports. If it's a threading issue, then check your thread safety - make sure variables accessable by both threads are protected. If it's a memory issue, then check your allocations and dereferences and especially be suspicious of code that allocates and returns memory, or code that uses memory allocation by someone else who may be releasing it.

Kieveli
I don't believe it is threading. I'm almost certain that it's down to Windows messages (more specifically, .NET events) in my particular scenario. We did some instrumenting to try and confirm the root cause, but this was only partially successful as it changed the race.
Jeff Yates
There could be threads under the covers, sometimes the trickiest to find.
sammyo
+2  A: 

I use what i call "heavy style defensive programming" : add asserts in all the modules that seems linked by the problem. What i mean is, add A LOT of asserts, asserts evidences, assert state of objects in all their memebers, assert "environnement" state, etc.

Asserts help you identify the code that is NOT linked to the problem.

Most of the time i find the origin of the problem just by writing the assertions as it forces you to reread all the code and plundge under the guts of the application to understand it.

Klaim
Yup, we did a lot of this to at least try to identify the root cause but it actually meant the race stopped occurring and thus the bug went away.
Jeff Yates
Looks a lot like memory trahsing then. You should try reviewing all your arrays, replace them by std::array, then search for potentiel virtual inheritance problem like when you forget to set a virtual destructor somewhere?
Klaim
Don't asserts get compiled out of release builds anyway?
Neil Barnwell
Yes, and if you got the problem only in release, then you have a big first clue on what is going on.Anyway, it's recommended to replace the default assert() function by something more convenient and customized to specify the informations on assert and make it throw on release when needed, as here.
Klaim
+1  A: 

In this situation, where nothing else works, I introduce additional logging.

I also add in email notifications that show me the state of the application when it breaks down.

Sometimes I add in performance counters... I put that data in a table and look at trends.

Even if nothing shows up, you are narrowing things down. One way or another, you will end up with useful theories.

Brian MacKay
We did a lot of this but adding the logging changed the race condition such that the good side always won (i.e. no bug). As our particular issue seems to be a sequence of events, it was also only partially able to indicate what sequence was the bad sequence.
Jeff Yates
Well, that's progress. If it's a race condition, maybe you could strategically add in some delays (temporarily...) and observe the result?
Brian MacKay
+2  A: 

First you need to get stack traces from your clients, that way you can actually do some forensics.

Next do fuzz tests with random input, and keep these tests running for long stretches, they're great at finding those irrational border cases, that human programmers and testers can find through use cases and understanding of the code.

Robert Gould
We have the stack traces. The crash itself is clear but the root cause is not. Unfortunately, crashes don't always occur or start from where the problem lies, as is the case in our problem. Random input would be nice for evaluating the fix, but wouldn't help in confirming the root cause.
Jeff Yates
true, the real issue is like quantum physics , or Zen :) anyways my apriach is to counter these with chaos theory.
Robert Gould
+1  A: 

Some questions you could ask yourself:

  • When did this piece of code last work without problem.
  • What has been done since it stopped working.

If the code has never worked the approach would be different naturally.

At least when many users change a lot of code all the time this is a very common scenario.

Nailer
We know what code changes led to the bug appearing and this has helped in identifying potential fixes, but I suspect the fix we have is masking the real problem, which exists in older pre-existing code.
Jeff Yates
+1  A: 

These are horrible and almost always resistant to the 'fixes' the engineer thinks he is putting in, as they have a habit of coming back to bite months later. Be wary of any fixes made to intermittent bugs. Be prepared for a bit of grunt work and intensive logging as this sounds more of a testing problem than a development problem.

My own problem when overcoming bugs like these was that I was often too close to the problem, not standing back and looking at the bigger picture. Try and get someone else to look at how you approach the problem.

Specifically my bug was to do with the setting of timeouts and various other magic numbers that in retrospect where borderline and so worked almost all of the time. The trick in my own case was to do a lot of experimentation with settings that I could find out which values would 'break' the software.

Do the failures happen during specific time periods? If so, where and when? Is it only certain people that seem to reproduce the bug? What set of inputs seem to invite the problem? What part of the application does it fail on? Does the bug seem more or less intermittent out in the field?

When I was a software tester my main tools where a pen and paper to record notes of my previous actions - remember a lot of seemingly insignificant details is vital. By observing and collecting little bits of data all the time the bug will appear to become less intermittent.

AndyUK
+1  A: 

Specific scenario

While I don't want to concentrate on only the issue I am having, here are some details of the current issue we face and how I've tackled it so far.

The issue occurs when the user interacts with the user interface (a TabControl to be exact) at a particular phase of a process. It doesn't always occur and I believe this is because the window of time for the problem to be exhibited is small. My suspicion is that the initialization of a UserControl (we're in .NET, using C#) coincides with a state change event from another area of the application, which leads to a font being disposed. Meanwhile, another control (a Label) tries to draw its string with that font, and hence the crash.

However, actually confirming what leads to the font being disposed has proved difficult. The current fix has been to clone the font so that the drawing label still has a valid font, but this really masks the root problem which is the font being disposed in the first place. Obviously, I'd like to track down the full sequence, but that is proving very difficult and time is short.

Approach

My approach was first to look at the stack trace from our crash reports and examine the Microsoft code using Reflector. Unfortunately, this led to a GDI+ call with little documentation, which only returns a number for the error - .NET turns this into a pretty useless message indicating something is invalid. Great.

From there, I went to look at what call in our code leads to this problem. The stack starts with a message loop, not in our code, but I found a call to Update() in the general area under suspicion and, using instrumentation (traces, etc), we were able to confirm to about 75% certainty that this was the source of the paint message. However, it wasn't the source of the bug - asking the label to paint is no crime.

From there, I looked at each aspect of the paint call that was crashing (DrawString) to see what could be invalid and started to rule each one out until it fell on the disposable items. I then determined which ones we had control over and the font was the only one. So, I took a look at how we handled the font and under what circumstances we disposed it to identify any potential root causes. I was able to come up with a plausible sequence of events that fit the reports from users, and therefore able to code a low risk fix.

Of course, it crossed my mind that the bug was in the framework, but I like to assume we screwed up before passing the blame to Microsoft.

Conclusion

So, that's how I approached one particular example of this kind of problem. As you can see, it's less than ideal, but fits with what many have said.

Jeff Yates
+1  A: 

For a difficult-to-reproduce error, the first step is usually documentation. In the area of the code that is failing, modify the code to be hyper-explicit: One command per line; heavy, differentiated exception handling; verbose, even prolix debug output. That way, even if you can't reproduce or fix the error, you can gain far more information about the cause the next time the failure is seen.

The second step is usually assertion of assumptions and bounds checking. Everything you think you know about the code in question, write .Asserts and checks. Specifically, check objects for nullity and (if your language is dynamic) existence.

Third, check your unit test coverage. Do your unit tests actually cover every fork in execution? If you don't have unit tests, this is probably a good place to start.

The problem with unreproducible errors is that they're only unreproducible to the developer. If your end users insist on reproducing them, it's a valuable tool to leverage the crash in the field.

Jekke
+1  A: 

You say in a comment that you think it is a race condition. If you think you know what "feature" of the code is generating the condition, you can write a test to try to force it.

Here is some risky code in c:

const int NITER = 1000;
int thread_unsafe_count = 0;
int thread_unsafe_tracker = 0;

void* thread_unsafe_plus(void *a){
  int i, local;
  thread_unsafe_tracker++;
  for (i=0; i<NITER; i++){
    local = thread_unsafe_count;
    local++;
    thread_unsafe_count+=local;
  };
}
void* thread_unsafe_minus(void *a){
  int i, local;
  thread_unsafe_tracker--;
  for (i=0; i<NITER; i++){
    local = thread_unsafe_count;
    local--;
    thread_unsafe_count+=local;
  };
}

which I can test (in a pthreads enironment) with:

pthread_t th1, th2;
pthread_create(&th1,NULL,&thread_unsafe_plus,NULL);
pthread_create(&th2,NULL,&thread_unsafe_minus,NULL);
pthread_join(th1,NULL);
pthread_join(th2,NULL);
if (thread_unsafe_count != 0) {
  printf("Ah ha!\n");
}

In real life, you'll probably have to wrap your suspect code in some way to help the race hit more ofter.

If it works, adjust the number of threads and other parameters to make it hit most of the time, and now you have a chance.

dmckee
+1  A: 

I've run into bugs on systems that seem to consistently cause errors, but when stepping through the code in a debugger the problem mysteriously disappears. In all of these cases the issue was one of timing.

When the system was running normally there was some sort of conflict for resources or taking the next step before the last one finished. When I stepped through it in the debugger, things were moving slowly enough that the problem disappeared.

Once I figured out it was a timing issue it was easy to find a fix. I'm not sure if this is applicable in your situation, but whenever bugs disappear in the debugger timing issues are my first suspects.

Eric Ness
A: 

Unless there are major time constraints, I don't start testing changes until I can reliably reproduce the problem.

If you really had to, I suppose you could write a test case that appears to sometimes trigger the problem, and add it to your automated test suite (you do have an automated test suite, right?), and then make your change and hope that test case never fails again, knowing that if you didn't really fix anything at least you now have more chance of catching it. But by the time you can write a test case, you almost always have things reduced down to the point where you're no longer dealing with such an (apparently) non-deterministic situation.

skiphoppy
A: 

Once you fully understand the bug (and that's a big "once"), you should be able to reproduce it at will. When the reproduction code (automated test) is written, you fix the bug.

How to get to the point where you understand the bug?

Instrument the code (log like crazy). Work with your QA - they are good at re-creating the problem, and you need to arrange to have full dev toolkit available to you on their machines. Use automated tools for uninitialized memory/resources. Just plain stare at the code. No easy solution there.

Arkadiy
A: 

For Java-based applications, I'd recommend using ReplayDIRECTOR: http://replaysolutions.com/ (I work for them!)

It's very useful for reproducing bugs that are hard to reproduce by records all the interactions of your Java app with its surrounding environment (user input, system calls, DB responses) as your application is being used, and allows later replay of the recorded session, with the application actually running and executing the same path through the code. The recorded inputs will be fed to the application exactly as during the recording. In essence it's a time machine for java applications.