views:

36

answers:

3

We have a legacy third-party telephony system built on something called "CT ADE" that periodically hangs for a few seconds (5 to 30) then resumes. During these hangs, users experience frustrating pauses in the phone menu. This has been going on for several weeks at least.

This code was not written by me, so my knowledge of it is very limited. Internally there are multiple "tasks" (threads?), one per phone line, that handle calls. When the application hangs, all "tasks" are hung.

This issue does not seem to be load related. It occurs even during times of low usage. It does not appear to be network related (occurs on systems where the DB is located on the same physical box as this app). Does not appear to be network or disk related, although creating sample tasks that do lots of DB I/O and File I/O can cause shorter pauses within this application.

The process does not show any memory or cpu spikes when the problem occurs.

At this point I'm just grasping for anything to try...

A: 

Working with legacy code is painful - in my experience you just need to dive in and try and understand what the code is doing through whatever means works for you - be it by reading the code and trying to figure out what it does, or debugging various scenarios and stepping through each line of code executed.

It will take a while, and there will be parts of the code you will never understand, but given enough time staring at the code and experimenting with what it does you should eventually be able to understand enough to figure out what the problem is.

There is a book Working Effectively with Legacy Code which I have never read but is meant to be very good.

Kragen
By legacy I meant an aging application. the custom code built on top of that application is actively being maintained. (not by me, fyi). I'm merely trying to think of possible causes to help resolve the issue.
Matthew Timbs
A: 

Try running a sampling profiler during one of these hangs to see where CPU time is being spent.

Paul R
CPU load does not increase.
Matthew Timbs
@Matthew - even so, a sampling profiler may still tell you that what little CPU time is being spent, is being spent in some kind of deadlock that eventually times out.
Paul R
A: 

If the the problem is not related to high cpu usage a profile probably will not gain you anything.

For me it sounds like a multi-threading issue. If possible attach with a debugger and pause when the problem shows. Look at the currently executed code / call stacks of all threads. It might be that multiple threads try to access a single resource or thread-safe function and have to wait because another thread has exclusive access of this resource. This might be something inconspicuous like trying to write to a log.

rotti2
That was my thought too, but we see it on some servers but not others. We can't force it to happen, and it clears up on its own, so attaching a debugger will be very difficult. Its a good thought though. If we could figure out a way to reproduce it then the debugger would be a good next-step. Thanks.
Matthew Timbs