views:

32

answers:

2

I have a project where I am required to fix this program that has the tendency to crash very non-deterministically. This piece of software performs lots of calculations and database calls and can have a very high load, meaning lots of clients.

It is a very critical component and without it nothing works. It needs to perform and be able to run without user interaction for long times.

It is actually a native C++/ATL project with COM for communication between its two executables.

I have spent a lot of time now actually studying the code and looking for obvious code flaws, such as not locking of shared variables (those that are obvious), exception handlers that don't do anything with an exception, besides 'return false', even if this could be a critical exception.

But I wanted to know if anyone has some tips for in regards to tackling a project like this, where many people have actually attempted to fix the issue and failed, and now you've taken the challenge and don't want to fail.

I am prepared to go far to fix this, however I need some guidance as to how to go about it in a good way?

My idea is to first set up a test environment and hope to collect as much information as possible about crashes that do occur, and then find, through logging, stack traces, etc, the points of the crashes. This may or may not be a good way to debug such a project.

Any input is appreciated?

+2  A: 

It may be obvious, but my roadmap for such bugfixing task is :

  1. Collect as many information as possible on crash source (users, developpers, etc).
  2. Inspect documentation and dependencies.
  3. Inspect source code.
  4. Build an isolated test env and try to reproduce.

If you still can't find the source of the bug, try to sanitize the source code and to add a more verbose logging system.

Regards

OMG_peanuts
A: 

Log, log, log, log.

Mau