views:

649

answers:

14

The question says it all. If you have a bug that multiple users report, but there is no record of the bug occurring in the log, nor can the bug be repeated, no matter how hard you try, how do you fix it? Or even can you?

I am sure this has happened to many of you out there. What did you do in this situation, and what was the final outcome?


Edit: I am more interested in what was done about an unfindable bug, not an unresolvable bug. Unresolvable bugs are such that you at least know that there is a problem and have a starting point, in most cases, for searching for it. In the case of an unfindable one, what do you do? Can you even do anything at all?

A: 

Discuss the problem, read code, often quite a lot of it. Often we do it in pairs, because you can usually eliminate the possibilities analytically quite quickly.

krosenvold
+3  A: 

It can be difficult, and sometimes near impossible. But my experience is, that you will sooner or later be able to reproduce and fix the bug, if you spend enough time on it (if that spent time is worth it, is another matter).

General suggestions that might help in this situation.

  • Add more logging, if possible, so that you have more data the next time the bug appears.
  • Ask the users, if they can replicate the bug. If yes, you can have them replicate it while watching over their shoulder, and hopefully find out, what triggers the bug.
driis
Yeah--not 10 minutes ago I got a call, it looks like someone has finally managed to reproduce a bug that I have only suspected but not been sure of for many years. It's been something that showed up maybe twice a year, never could be reproduced and the observed behavior could always have been due to silently-loaded price changes in the past.
Loren Pechtel
+2  A: 

Ask my last boss. He seemed to be convinced I was supposed to be able to fix irreproducible bugs.

Dinah
Is this an answer?
Akash Kava
It's not bad, to be honest!
Brian Agnew
@Dinah. Give us his email address and we'll ask him :-)
Stephen C
Once had a boss who didn't want bugs in his software. To be fair it was the president of the company (so he didn't know much about programming) but still. :)
Silence
AFAIK almost nobody wants bug in his software...
Jeremy Friesner
+2  A: 

modify the code where you think the problem is happening, so extra debug info is recorded somewhere. when it happens next time, you will have what your need to solve the problem.

KM
+4  A: 

If it's a GUI app, it's invaluable to watch the customer generate the error (or try to). They'll no doubt being doing something you'd never have guessed they were doing (no wrongly, just differently).

Otherwise, concentrate your logging in that area. Log most everything(you can pull it out later), and get your app to dump its environment as well. e.g. machine type, VM type, encoding used.

Does your app report a version number, a build number etc.? You need this to determine precisely which version you're debugging (or not!).

If you can instrument your app (e.g. by using JMX if you're in the Java world) then instrument the area in question. Store stats e.g. requests+parameters, time made etc. Make use of buffers to store the last 'n' requests/responses/object versions/whatever, and dump them out when the user reports an issue.

Brian Agnew
A: 

There are tools like gotomeeting.com, which you can use to share screen with your user and observe the behaviour. There could be many potential problems like number of softwares installed on their machines, some tools utility conflicting with your program. I believe gotomeeting, is not the only solution, but there could be timeout issues, slow internet issue.

Most of times I would say softwares do not report you correct error messages, for example, in case of java and c# track every exceptions.. dont catch all but keep a point where you can catch and log. UI Bugs are difficult to solve unless you use remote desktop tools. And most of time it could be bug in even third party software.

Akash Kava
+1  A: 

There are two types of bugs you can't replicate. The kind you discovered, and the kind someone else discovered.

If you discovered the bug, you should be able to replicate it. If you can't replicate it, then you simply haven't considered all of the contributing factors leading towards the bug. This is why whenever you have a bug, you should document it. Save the log, get a screenshot, etc. If you don't, then how can you even prove the bug really exists? Maybe it's just a false memory?

If someone else discovered a bug, and you can't replicate it, obviously ask them to replicate it. If they can't replicate it, then you try to replicate it. If you can't replicate it quickly, ignore it.

I know that sounds bad, but I think it is justified. The amount of time it will take you to replicate a bug that someone else discovered is very large. If the bug is real, it will happen again naturally. Someone, maybe even you, will stumble across it again. If it is difficult to replicate, then it is also rare, and probably won't cause too much damage if it happens a few more times.

You can be a lot more productive if you spend your time actually working, fixing other bugs and writing new code, than you will be trying to replicate a mystery bug that you can't even guarantee actually exists. Just wait for it to appear again naturally, then you will be able to spend all your time fixing it, rather than wasting your time trying to reveal it.

Apreche
Depends on the severity of the consequences. A bug that silently shifts money values 1 decimal place for a bank that happens about once a year will definitely cause a lot of problems regardless of how rare it is.
Davy8
+17  A: 

Language

Different programming languages will have their own flavour of bugs.

In C, adding debug statements can make the problem impossible to duplicate because the debug statement itself shifts pointers (far enough to avoid a SEGFAULT). Pointer issues are a nightmare to track and replicate, but there are debuggers (such as GDB and DDD) that can help.

In Java, an application that has multiple threads might only show its bugs with a very specific timing or sequence of events.

Environment

Is it a server-based web application? Is it a desktop application? Is it browser-based? Does it only happen in production?

Depending on the complexity of the environment in which the application (that has the bug) is running, the only recourse might be to simplify the environment.

Exit extraneous applications and kill background tasks.

Variables and Consistency

Eliminate as many unknowns as possible. Isolate architectural components. Simplify by removing non-essential, or possibly problematic (conflicting), elements. Deactivate different application modules.

Remove all differences between production, test, and development. Use the same hardware. Follow the exact same steps, perfectly, to setup the computers. Consistency is key.

Logging

Logging is an invaluable tool. With it you can correlate the time events happened. You can examine logs for any obvious errors.

Hardware

If all the software pieces appear fine, consider that hardware is a possible source of the issue. Are the network connections solid? Do the harddrives have bad blocks? Are the CPU fans whirring away? Does the motherboard have enough power for all components (CPU, network card, video card, drives)?

What happens when you run the application locally (i.e., not across the network)? Are other servers experiencing the same issues? Is the database remote? Can you use a local database?

Time and Statistics

When does the problem happen? How frequently? What other systems are running at that time? Gather hard numerical data on the problem. A problem that might, at first, appear random, might actually have a pattern.

Change Management

When did the problem first start? What changed in the environment (hardware and software)? What happens when you roll back to a previous version? What are the differences between the version that has problems and the version that does not?

Library (Mis)management

Windows has DLL Hell: conflicting versions of DLLs littered throughout the system.

Unix has libraries sprinkled throughout directories, shot with broken symbolic links.

Perform a fresh install of the operating system, and include only the supporting software required for your application.

Java library files can be equally evil. Make sure every library is only being used once. Sometimes the application container will have a different version of a library than the application itself. This might not be possible to replicate in your development environment. Use a library management tool such as Maven or Ivy.

Sleep

It is worth reiterating what others have mentioned: sleep on it. Spend time away from the problem, finish other tasks (like documentation). Be physically distant from computers and get some exercise.

Bug Characteristics and Testing

Code a detection method that triggers a notification (e.g., log, e-mail, pop-up, pager beep) when the bug happens. Use automated testing to submit data into the application. Use random data and use data that covers known and possible edge cases. Eventually the bug should reappear.

Dave Jarvis
+2  A: 

Assuming you have already added all the logging that you think would help and it didn't... two things spring to mind:

  1. Work backwards from the reported symptom. Think to yourself.. "it I wanted to produce the symptom that was reported, what bit of code would I need to be executing, and how would I get to it, and how would I get to that?" D leads to C leads to B leads to A. Accept that if a bug is not reproducible, then normal methods won't help. I've had to stare at code for many hours with these kind of thought processes going on to find some bugs. Usually it turns out to be something really stupid.

  2. Remember Bob's first law of debugging: if you can't find something, it's because you're looking in the wrong place :-)

Bob Moore
+4  A: 

Sometimes I just have to sit and study the code until I find the bug. Try to prove that the bug is impossible, and in the process you may figure out where you might be mistaken. If you actually succeed in convincing yourself it's impossible, assume you messed up somewhere.

It may help to add a bunch of error checking and assertions to confirm or deny your beliefs/assumptions. Something may fail that you'd never expect to.

David
+1  A: 

Start by looking at what tools you have available to you. For example crashes on a Windows platform go to WinQual, so if this is your case you now have crash dump information. Do you can static analysis tools that spot potential bugs, runtime analysis tools, profiling tools?

Then look at the input and output. Anything similar about the inputs in situations when users report the error, or anything out of place in the output? Compile a list of reports and look for patterns.

Finally, as David stated, stare at the code.

Stephen Nutt
A: 

Ask user to give you a remote access for his computer and see everything yourself. Ask user to make a small video of how he reproduces this bug and send it to you.

Sure both are not always possible but if they are it may clarify some things. The common way of finding bugs are still the same: separating parts that may cause bug, trying to understand what`s happening, narrowing codespace that could cause the bug.

Yaroslav Yakovlev
+2  A: 

If you can't replicate it, you may fix it, but can't know that you've fixed it.

I've made my best explanation about how the bug was triggered (even if I didn't know how that situation could come about), fixed that, and made sure that if the bug surfaced again, our notification mechanisms would let a future developer know the things that I wish I had known. In practice, this meant adding log events when the paths which could trigger the bug were crossed, and metrics for related resources were recorded. And, of course, making sure that the tests exercised the code well in general.

Deciding what notifications to add is a feasability and triage question. So is deciding on how much developer time to spend on the bug in the first place. It can't be answered without knowing how important the bug is.

I've had good outcomes (didn't show up again, and the code was better for it), and bad (spent too much time not fixing the problem, whether the bug ended up fixed or not). That's what estimates and issue priorities are for.

Karl Anderson
I did that once. I managed to determine where the bug was right down to the line of code, fixed it, only to have it not be fixed because there was a second instance of the bug 500 lines away.
Joshua
A: 

Make random changes until something works :-)

Michael Wiles