views:

636

answers:

7

Hi all.

I'm having a dead-end situation with one of the clients using my software. Out of about 40 copies of our product sold (Application programmed in .NET 2.0 using VB.NET 2005), about 2 get non-responsive with 1 core of the dual core CPUs stuck at 100% (program uses 1 core only)

The most logical guess is an infinite loop causing this behavior, but the are thousands of lines of code with many, many loops. That is all the information I've got; now, how do you suggest I approach debugging this problem?

EDIT: Basically, the software is responsible for calculating amount of credit spent using other devices, such as PCs, etc. It is a Cybercafe management program and fails intermittently i.e. it is subtracting credit when is fails. It does other things in the background too, like checking to see if it is time to create a database backup, among other things.

EDIT: Solved. It was the most unlikely problem. The Access Database Engine which I used as the DBMS is actually the part of my application that is problematic. It has difficulty working with a row-JUST ONE FRIGGIN ROW-in one of the tables. I can't delete it, or otherwise add a record related to that row in any other table; Even MS Access 2007 causes the CPU to go up to 100% when I try to work with that row!

A simple "Compact and Repair" command fixed everything. I guess I'll issue that command every time my application starts up. That would prevent this from happening again.

Thanks to WinDbg I could find where the problem was. I recommend everyone to learn how to use it 'cause it's a real time saver.

+2  A: 

Well you'll need to work out where it's tight-looping. What is your client doing with the software at the time? What does the software do in the first place?

You might want to consider adding a lot of logging to your code and giving the client a copy with all that logging enabled, helping you to trace where they're having problems.

Jon Skeet
+4  A: 

If possible, get a process dump and look at the stack trace.
I never did it but it should work with VS/WinDbg and SOS (Son of Strike). Here is a blog post about it.

+2  A: 

Use a logger like log4net which you can introduce to your existing codebase with postsharp. Log all method entries and exits - so you should find the faulty method. Then you can improve your logging if it is still required.

It looks like this is working for vb.net, too, although I have no experience there. Maybe this article helps you a bit.

tanascius
Aha. Tracing everything with Postsharp could truly do the trick and narrow findings to particular function.
Arnis L.
+1  A: 

There may be a problem with single-core and multi-core CPUs behaving differently, for example when a background thread tries to update the UI.

(And I admit that I wrote an app back in the dark ages that did not cleanly separate background and UI threads and caused problems when multi-core CPUs got mainstream. The solution was to call the SetProcessAffinity to restrict the app to a single core)

If that's the case, you should check whether the 100% CPU only occur with a special kind of CPU, and whether using SetProcessAffinity solves the problem. If it does, you know where to look for in your code.

devio
You can set CPU affinity on a running process using the free Process Explorer utility. It should be enough to see whether devio's answer helps.
Thomas Bratt
I know, Windows TaskMgr can do it too.
TheAgent
+4  A: 

If it is an infinite loop, then try attaching a debugger and hitting break. WinDbg is ideal for this.

The technique also works for the case when the loop is just iterating too many times but eventually carries on with the rest of the code. It is possible to spend a couple of minutes repeating the procedure to get a good sample.

This technique has saved me several times and works well for hung applications too :)

Thomas Bratt
+1 Excellent suggestion, Thomas. It seems like a lot of developers today do not use applications like WinDbg. SOS will be required for providing information about the CLR, as well. Link provided for the OP's reference: http://msdn.microsoft.com/en-us/library/bb190764.aspx
joseph.ferris
Thanks :) Good point about SOS. The magic line for .NET is: '.loadby sos mscorwks'
Thomas Bratt
WinDbg is not a managed code system. Most dev's who've never even seen C++ code have no idea it exists. I do but I don't use it every day which makes it harder again when I do have to use it. You have to write some really interesting code to need windbg in C#.
Spence
Not that interesting though ;) I find it excellent for tracking .NET memory leaks (using gcroot etc) which just can't be done in the same way with MS Visual Studio.
Thomas Bratt
Such tools have to become a part of VS, with some improved UI. I have to google the crap out of the net to understand it can't debug .NET applications on its own.
TheAgent
+7  A: 

Install windbg (Windows debugger) on the target machine. Invoke the debugger, and attach to the suspicious process, run the program and then wait until problem happens. When the problem happens, invoke the following command in the debugger command line

!runaway

This will show which of your threads are consuming most of the time. Then get several thread stacks from that thread that is consuming most of your cpu resources.

Here is an example:

0:015> !runaway

User Mode Time Thread Time 0:1074 0 days 0:00:21.637 11:137c 0 days 0:00:02.792 4:12c8 0 days 0:00:00.530 9:1374 0 days 0:00:00.046 15:13d0 0 days 0:00:00.000 14:1204 0 days 0:00:00.000 13:154c 0 days 0:00:00.000 12:144c 0 days 0:00:00.000 10:1378 0 days 0:00:00.000 8:1340 0 days 0:00:00.000 7:12f0 0 days 0:00:00.000 6:12d4 0 days 0:00:00.000 5:12d0 0 days 0:00:00.000 3:12c4 0 days 0:00:00.000 2:12c0 0 days 0:00:00.000 1:12b4 0 days 0:00:00.000

Now assume we want a call stack for the second thread in the list, thread 11, so we first switch to thread 11. This can be done by entering ~11s.

0:015> ~11s

eax=03fbb270 ebx=ffffffff ecx=00000002 edx=00000060 esi=00000000 edi=00000000 eip=77475e74 esp=0572f60c ebp=0572f67c iopl=0 nv up ei pl zr na pe nc cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00000246 ntdll!KiFastSystemCallRet: 77475e74 c3 ret

Now get a call stack for this thread by executing kp:

0:011> kp
ChildEBP RetAddr  
0572f608 77475620 ntdll!KiFastSystemCallRet
0572f60c 75b09884 ntdll!NtWaitForSingleObject+0xc
0572f67c 75b097f2 kernel32!WaitForSingleObjectEx+0xbe
*** ERROR: Symbol file could not be found.  Defaulted to export symbols for C:\Program Files\Mozilla Firefox 3.1 Beta 1\nspr4.dll - 
0572f690 10019a0b kernel32!WaitForSingleObject+0x12
WARNING: Stack unwind information not available. Following frames may be wrong.
0572f6ac 10015979 nspr4!PR_MD_WAIT_CV+0x8b
0572f6c4 10015763 nspr4!PR_GetPrimordialCPU+0x79
*** ERROR: Symbol file could not be found.  Defaulted to export symbols for C:\Program Files\Mozilla Firefox 3.1 Beta 1\xul.dll - 
0572f6e0 64d44d6a nspr4!PR_Wait+0x33
0572f708 64dbe67e xul!NS_CycleCollectorForget2_P+0x698a
0572f72c 10019b3f xul!gfxWindowsPlatform::FontEnumProc+0xfd4e
0572f734 10015d32 nspr4!PR_MD_UNLOCK+0x1f
0572f738 1001624b nspr4!PR_Unlock+0x22
0572f754 1001838d nspr4!PRP_TryLock+0x4cb
00000000 00000000 nspr4!PR_Now+0x109d

The command kp will print the parameters. Local variables can be printed with dv.

Alternatively you can use process explorer from sysinternals.

If all this is not possible, because it is a remote client machine, install userdump, which creates a dump file that can be sent to you for further analysis. You can create a batch file for the customer to invoke userdump with the correct parameters. Userdump is a tool from Microsoft, which can be downloaded from their web page.

steve
I see numbers as identifiers for threads. How can I know which thread it is that is causing problems? Is there a way to know which method that particular thread is executing?
TheAgent
This was quite helpful. Now the only thing I need is to know what parameters have been sent to all the methods in the stack. How do I get the parameter values?
TheAgent
A: 

Could it be a threading problem? "fails intermittently" makes me think of it. Does the program receive signals/messages from the outside, like remoting/DCOM/sockets? Is progress information related to such messages presented in the user interface?

I once detected a threading problem because I always use a lot of ASSERTs. There was a sanity check ASSERT for the beginning of a message received through XML-RPC to be:

"<?xml " 

and the ASSERT catched an overwrite of the memory of the message. By inspection this turned out to be due a missing lock in a critical section. This detection was only possible because the problem was catched so early by the ASSERT (and it happened sufficiently often to be detected).

This is not very specfific or directed advice, but my suggestion then is to add ASSERTs in places that may be affected a threading problem.

Note that firing ASSERTs does not necessarily imply aborting the program or throw message boxes. ASSERTs can be redirected to a log file instead, including the full stack trace at the time of the ASSERT firing.

Peter Mortensen