ansaurus

Question

"Works on my machine" - How to fix non-reproducible bugs?

Answer 1

A:

unit testing

Szczepan 2009-07-09 09:15:56

**Buzzword alarm!** If you had a certain idea in mind on how unit testing could solve cross-machine problems or at least produce hints, please specify them. Unitl then: -1

Boldewyn 2009-07-09 09:24:38

I agree with Boldewyn. Unit Testing will not help.

Holli 2009-07-09 09:50:02

Answer 2

A:

Its very complicated issue . I was thinking writing some procedure for this . I just made some procedure for this non-reproducible bug . it might be helpful

When the Bug accorded .. There are several factors it might to occur.

I am Sure all bugs are reproducible . I always keep eye for these kind of issues..

Get the System Information
what other process the customer did before that.
Time period it occurs . its rare or frequent
its next action happened after the issue ( its always same or different )
Find the factors for this bug ( as developer )
Find the exact position where this issue happened .
Find ALL THE SYSTEM Factors on that time
check all memory leaks or user error issue or wrong condition in code
List out all facotrs to may cause this issue.
How the each factors are affected this and wat are the data is holding those factors
Check memeory issues happened
check the customer have the current update code like yours
check all log from atleast 1 month and find any upnormal operation happened . keep on note

joe 2009-07-09 09:19:02

Answer 3

+9 A:

The easiest way is always to see the customer in action (assuming that its readily reproducible by the customer). Oftentimes, problems arise due to issues with the customer's computer environment, conflicts with other programs, etc - these are details which you will not be able to catch on your dev rig. So a site visit might be useful; but if that's not convenient, tools like RealVNC might help as well in letting you see the customer 'do their thing'.

(watching the customer in action also allows you to catch them out in any WTF moments that they might have)

Now, if the problem is intermittent, then things get somewhat more complicated. The best way to get around this problem would be to log useful information in places where you guess problems could occur and perhaps use a tool like Splunk to index the log files during analysis. A diagnostic build (i.e. with extra logging) might be useful in this case.

jpoh 2009-07-09 09:19:34

We found Webex useful for doing online demonstrations to clients and also for helping clients out with problems.

David Liddle 2009-07-09 12:04:08

Answer 4

+10 A:

Extensive logging usually helps.

Kirill V. Lyadvinsky 2009-07-09 09:20:16

stacktrace and other information (running processes)

Umair Ahmed 2009-07-09 09:25:02

also ask for a detailed recipe for reproducing the error on clients machine. recent software installs.

Umair Ahmed 2009-07-09 09:28:09

Answer 5

+5 A:

I'm just in the middle of implementing an automated error reporting system that sends back to me information (currently via email although you could use a webservice) from any exception encountered by the app.

That way I get (nearly) all the information that I would do if I was sitting in front of VS2008 and it really helps me to work out what the problem is.

The customers are also usually (sorta) impressed that I know about their problem as soon as they encounter it!

Also, if you use the Application.ThreadException error handler you can send back info on unexpected exceptions too!

Calanus 2009-07-09 09:21:13

Good point - I'm already using MadExcept for all our apps, with emailed bug reports straight back to base. However, in my latest case, the app just quits without warning!

Roddy 2009-07-09 09:58:38

Unfortunately this only tells you when an error is happening, does not tell you why and what steps were done to make it happen

Aleris 2009-07-09 10:41:09

+1, though of course for some bugs the behaviour is not an exception, but just "I expected A and instead the wrong thing B happened".

Daniel Daranas 2009-07-09 10:46:40

Answer 6

+3 A:

We use all the methods you mention progressively starting with the easiest and proceeding to the harder.

However you forget that sometimes hardware is at fault. For example, memory could be malfunctioning and some computation-intensive code will behave strangely throwing exceptions with weird diagnostics. Of cource, it works on your machine, since you don't have faulty hardware.

Experience is needed to identify such errors and insist that customer tries to install the program on another machine or does hardware check. One thing that helps greatly is good error handling - when your code throws an exception it should provide details, not just indicate that something is bad. With good error indication it's easier to identify such suspicious issues related to faulty hardware.

sharptooth 2009-07-09 09:21:37

Answer 7

A:

One technique I've found useful is building an application with an integrated "diagnostic" mode (enabled by a command line switch when you launch the app). That certainly avoids the need to create custom builds with additional logging.

Otherwise, it sounds like what you're doing is as good an approach as any.

butterchicken 2009-07-09 09:21:50

Answer 8

+1 A:

Copilot (assuming customer is somewhere cold and rainy :)

Daniel Daranas 2009-07-09 09:21:53

Answer 9

A:

I don't have this problem very often, but if I did, I would use a screen sharing or recorded application to watch the user in action without having to go there (unless, as you said, it's warm and sunny and the company pays the trip).

J. Pablo Fernández 2009-07-09 09:22:20

Answer 10

A:

I have recently been investigating such an issue myself. Over the course of my carrier I have learnt that, while computer systems may be complex, they are predictable so have faith that you can find the problem. My approach to these kinds of issues two fold:

1) Gather as much detailed information as possible from the customer about their failure and analyse it meticulously for patterns. Gather multiple sets of data for multiple failure occurrences to build up a clearer picture.

2) Try and reproduce the failure in house. Continue to make your system more and more similar to the customers system until you can reproduce it, the system is identical or it becomes impractical to make it more similar.

While doing this consider:

1)What differences exist between this system and other working systems.

2)What has recently changed in your product or the customers configuration that has caused the problem to start occurring.

Regards

Howard May 2009-07-09 09:22:24

Answer 11

+1 A:

The usual procedure for this is to expect something like this will happen and add a ton of logging information. Of course you don't enable it from the beginning, but only when this happens.

Usually customers don't like to have to install a new version or some diagnostic tools. It is not their job to do your debugging. And visiting a client for cases like these is rarely an option. You must involve the client as little as possible. Changing a switch and sending you a log file is OK - anything more than this is too much.

I like the alternative of thinking the problem at the bath. I will start from trying to find out the differences between my machine and the client's configuration.

kgiannakakis 2009-07-09 09:22:44

Answer 12

A:

Depending on the issue you could get WinDbg dumps, they normally give a pretty good idea of what is going on. We have diagnosed quite a few problems that weren't crashed from minidumps.

For .Net apps we also was Trace.Writeline then we can get the user to fire up DbgView and send us the output.

L2Type 2009-07-09 09:23:30

Answer 13

+15 A:

One of the attributes of good debuggers, I think is that they always have a lot of weapons in their toolkit. They never seem to get "stuck" for too long and there is always something else for them to try. Some of the things I've been known to do:

ask for memory dumps
install a remote debugger on a client machine
add tracing code to builds
add logging code for debugging purposes
add performance counters
add configuration parameters to various bits of suspicious code so I can turn on and off features
rewrite and refactor suspicious code
try to replicate the issue locally on a different OS or machine
use debugging tools such as application verifier
use 3rd party load generation tools
write simulation tools in-house for load generation when the above failed
use tools like Glowcode to analyse memory leaks and performance issues
reinstall the client machine from scratch
get registry dumps and apply them locally
use registry and file watcher tools

Eventually, I find the bug just gives up out of some kind of awe at my persistence. Or the client realises that it's probably a machine or client side install or configuration issue.

1800 INFORMATION 2009-07-09 09:25:10

+1 for awe, and also the fact that it's a pretty good list.

gridzbi 2009-07-09 10:00:01

Also stuff like VNC and remote desktop can help when you want to be able to actually see what the program does and how the user gets there.

Earlz 2009-12-29 15:21:00

Answer 14

A:

Just a short anecdote (hence 'community wiki'): Last week I thought it was a clever idea in a Django app to import the module pprint for pretty printing Python data only if DEBUG was True:

if settings.DEBUG:
    from pprint import pprint

Then I used here and there the pprint command as debugging statement:

pprint(somevar) # show somevar on the console

After finishing the work, I tested the app with setting DEBUG=False. You can guess what happened: The site broke with HTTP500 errors all over the place, and I did not know why, because there is no traceback if DEBUG is False. I was puzzled that the errors disappeared magically, if I switched back to debug mode.

It took me 1-2 hours of putting print statements all over the code to find that the code crashes at exactly the above pprint() line. Then it took me another half an hour to convince myself to stop banging my head on the table.

Now comes the moral of the story:

Not every thing that looks like a clever idea in the first view is such savvy in the end.
An important point to look at for debugging these errors are all configuration options and platform switches your code by itself makes. This can be quite a lot more than just some user preferences. Document good, if you make an assumption about the user's platform (e.g., if you test for Win/Mac/Linux only, will your code crash on BSD or Solaris?)

Cheers,

Boldewyn 2009-07-09 09:40:52

Answer 15

A:

However tough a non-reproducible problem is - we can still have a structured and strategic approach to solve them - and I can say through experience that it requires out of box thinking in 50% of the cases. Generally speaking, one can categorize the problems into different types which helps to identify what tool to be used. For example if you have a non-reproducible application crash issue or a memory issue you can use profilers and nail down the issue caused in the particular functionality.

Also, one of the most important approach is inforamation rich logging. I also use a lot of enums to describe the state of the process depending on the scenario in question. for exampe, I used like Initiated, Triggered, Running, Waiting Repaired etc to describe the schedules states and saved them to DB at different stages.

MSIL 2009-07-09 09:45:31

Answer 16

+2 A:

I think one of the most important things is the ability to ask sensible questions around what the customer has reported... More often than not they're not mentioning something that they don't see as relevant, but is actually key.

Telepathy would also be useful...

Paddy 2009-07-09 09:47:49

Answer 17

+1 A:

As a software engineer doing webstuff (booking/shop/member systems etc) the most important thing for us is to get as much information from the customer as possible.

Going from

it's broke!

to

it's broke! & here are screenshots of every option I picked whilst generating this particular report

reduces the amount of time it takes us to reproduce and fix an issue no end.

It may be obvious, but it takes a fair amount of chasing to get this kind of information from our customers sometimes! But it's worth it just for those moments you find they're not actually doing what they say they are.

Zeus 2009-07-09 09:56:05

Answer 18

+1 A:

I had these problems also. My solution was to add lots of logging and give the customer a debug build with all the possible debug information. Then make sure dr Watson (it was on Windows NT) created a memory dump with enough information. After loading the memory dump in the debugger I could find out where and why it crashed.

EDIT: Oh, this obviously only works if the application terminates violently...

rve 2009-07-09 09:56:10

Answer 19

+2 A:

We've had good success using EurekaLog with it posting directly to FogBugz. This gets us a bug report containing a call stack, along with related system info (other processes running, memory, network details etc) and a screen shot. Occasionally customers enter further info too, which is helpful. It's certainly, in most cases, made it much easier and quicker to fix bugs.

Pauk 2009-07-09 10:08:56

Very similar - we use MadExcept posting straight into fogbugz.

Roddy 2009-07-09 10:23:17

We originally started with MadExcept, but switched to EurekaLog since the output was better (may have changed now since that was a couple of years back).

Pauk 2009-07-09 10:25:27

Answer 20

A:

Not mentioned yet, but "directed code review" is one good solution, especially if you didn't do a proper review (at least 1 hour per 100 lines of code) before release.

I have also seen impressive demos of AppSight Suite, which is basically an advanced environment monitoring and logging tool. It allows the customer to record what happens on his machine in an extensive but fairly compact log file which you can then replay.

MSalters 2009-07-09 10:26:09

Answer 21

A:

As many have mentioned, extensive logging, and asking the client for the log files when something goes wrong. In addition, as I worked more with web apps, I'll also provide detailed, but succinct deployment documentation (e.g., deployment steps, environmental resources that need to be set up etc).

Here are common problems I've seen that lead to the types of problem you are describing:

Environment not set up properly (e.g., missing environment variables, data sources etc).
Application not fully deployed (e.g., database schema not deployed).
Difference in operating system configuration (default character encoding being the most common culprit for me).

Most of the time, these issues can be identified through the log content.

Jack Leow 2009-07-09 12:00:19

Answer 22

+1 A:

I think following the trail of the actions user took can lead us to the reasons of failure or selective failures. But most of the times users are at loss to precisely describe the interactions with the applications, the automatic screenshot taking (if it is desktop app. for .net app you can check Jeff's UnhandledExceptionHandler). Logging all the important action which change state of the objects can also help us in understanding it.

2009-07-09 12:01:03

Answer 23

A:

You can use tools like Microsoft SharedView or TeamViewer to connect to remote PC and inspect problem directly on site. Of course, you'll need cooperation with customer.

Alexander 2009-12-29 14:07:59

ansaurus

tags:

views:

answers:

"Works on my machine" - How to fix non-reproducible bugs?

related questions