views:

130

answers:

4

I'm looking for a way to debug a rare Delphi 7 critical section (TCriticalSection) hang/deadlock. In this case, if a thread is waiting on a critical section for more than say 10 seconds, I'd like to produce a report with the stack trace of both the thread currently locking the critical section and also the thread that failed to be able to lock the critical section after waiting 10 seconds. It is OK then if an exception is raised or the Application terminates.

I would prefer to continue using critical sections, rather than using other synchronization primitives, if possible, but can switch if necessary (such as to get a timeout feature).

If the tool/method works at runtime outside of the IDE, that is a bonus, since this is hard to reproduce on demand. In the rare case I can duplicate the deadlock inside the IDE, if I try to Pause to start debugging, the IDE just sits there doing nothing, and never gets to a state where I can view threads or call stacks. I can Reset the running program, though.

Update: In this case, I'm only dealing with one critical section and 2 threads, so this likely isn't a lock ordering problem. I believe there is an improper nested attempt to enter the lock across two different threads, which results in deadlock.

A: 

If you want to be able to wait on something with a timeout, you could try replacing your Critical Section with a TEvent signal. You can say to wait on the event, give it a timeout length, and check the result code. If the signal was set, then you can continue. If not, if it timed out, you raise an exception.

At least, that's how I'd do it in D2010. I'm not sure if Delphi 7 has TEvent, but it probably does.

Mason Wheeler
+6  A: 

You should create and use your own lock object class. It can be implemented using critical sections or mutexes, depending on whether you want to debug this or not.

Creating your own class has an added benefit: You can implement a locking hierarchy and raise an exception when it is violated. Deadlocks happen when locks are not taken in exactly the same order, every time. Assigning a lock level to each lock makes it possible to check that the locks are taken in the correct order. You could store the current lock level in a threadvar, and allow only locks to be taken that have a lower lock level, otherwise you raise an exception. This will catch all violations, even when no deadlock happens, so it should speed up your debugging a lot.

As for getting the stack trace of the threads, there are many questions here on Stack Overflow dealing with this.

Update

You write:

In this case, I'm only dealing with one critical section and 2 threads, so this likely isn't a lock ordering problem. I believe there is an improper nested attempt to enter the lock across two different threads, which results in deadlock.

That can't be the whole story. There's no way to deadlock with two threads and a single critical section alone on Windows, because critical sections can be acquired there recursively by a thread. There has to be another blocking mechanism involved, like for example the SendMessage() call.

But if you really are dealing with two threads only, then one of them has to be the main / VCL / GUI thread. In that case you should be able to use the MadExcept "Main thread freeze checking" feature. It will try to send a message to the main thread, and fail after a customizable time has elapsed without the message being handled. If your main thread is blocking on the critical section, and the other thread is blocking on a message handling call then MadExcept should be able to catch this and give you a stack trace for both threads.

mghie
+1 for the MadExcept thread frozen check.
Mason Wheeler
madExcept can also be asked to take a thread dump any time, so is perhaps ideal for this.
mj2008
madExcept looks like the best option. Thanks!
Anagoge
+2  A: 

This is not a direct anwer to your question, but something I ran into recently that had me (and a couple of colleagues) stumped for a while.

It was an intermittent thread hang, involving a critical section and once we knew the cause, it was very obvious and gave all of us a "d'oh" moment. However, it did take some serious hunting to find (adding more and more trace logging to pinpoint the offending statement) and that is why I thought I'd mention it.

It also was on a critical section enter. Another thread had indeed acquired that critical section. A dead lock as such did not seem to be the cause, as there was only one critical section involved, so there could be no problems with acquiring locks in a different order. The thread holding the critical section should simply have continued and then released the lock, allowing the other thread to acquire it.

In the end it turned out that the thread holding the lock was ultimately accessing the ItemIndex of a (IIRC) combobox, fairly innocuous it would seem. Unfortunately, getting that ItemIndex is reliant on message processing. And the thread waiting for the lock was the main application thread... (just in case anybody wonders: the main thread does all the message processing...)

We might have thought of this a lot earlier if it had been a little more obvious from the start that the vcl was involved. However, it started in non-ui related code and vcl involvement only became apparent after adding instrumentation (enter - exit tracing) along the call tree and back through all triggered events and their handlers up to the ui code.

Just hope this story will be of help to somebody faced with a mysterious hang.

Marjan Venema
+1  A: 

Use Mutex instead of Critical Section. There is a little difference between mutexes and critical sections - critical sections are more effective while mutexes are more flexible. Your can easily switch between mutexes and critical sections, using for example mutexes in debug version.

for critical section we use:

var
  FLock: TRTLCriticalSection;

  InitializeCriticalSection(FLock);  // create lock
  DeleteCriticalSection(FLock);      // free lock
  EnterCriticalSection(FLock);       // acquire lock
  LeaveCriticalSection(FLock);       // release lock

the same with mutex:

var FLock: THandle;

  FLock:= CreateMutex(nil, False, nil);  // create lock
  CloseHandle(FLock);                    // free lock
  WaitForSingleObject(FLock, Timeout);   // acquire lock
  ReleaseMutex(FLock);                   // release lock

You can use timeouts (in milliseconds; 10000 for 10 seconds) with mutexes by implementing acquire lock function like this:

function AcquireLock(Lock: THandle; TimeOut: LongWord): Boolean;
begin
  Result:= WaitForSingleObject(Lock, Timeout) = WAIT_OBJECT_0;
end;
Serg