views:

276

answers:

3

We have an Windows32 application in which one thread can stop another to inspect its state [PC, etc.], by doing SuspendThread/GetThreadContext/ResumeThread.

if (SuspendThread((HANDLE)hComputeThread[threadId])<0)  // freeze thread
   ThreadOperationFault("SuspendThread","InterruptGranule");
CONTEXT Context, *pContext;
Context.ContextFlags = (CONTEXT_INTEGER | CONTEXT_CONTROL);
if (!GetThreadContext((HANDLE)hComputeThread[threadId],&Context))
   ThreadOperationFault("GetThreadContext","InterruptGranule");

Extremely rarely, on a multicore system, GetThreadContext returns error code 5 (Windows system error code "Access Denied").

The SuspendThread documentation seems to clearly indicate that the targeted thread is suspended, if no error is returned. We are checking the return status of SuspendThread and ResumeThread; they aren't complaining, ever.

How can it be the case that I can suspend a thread, but can't access its context?

This blog http://www.dcl.hpi.uni-potsdam.de/research/WRK/2009/01/what-does-suspendthread-really-do/

suggests that SuspendThread, when it returns, may have started the suspension of the other thread, but that thread hasn't yet suspended. In this case, I can kind of see how GetThreadContext would be problematic, but this seems like a stupid way to define SuspendThread. (How would the call of SuspendThread know when the target thread was actually suspended?)

EDIT: I lied. I said this was for Windows.

Well, the strange truth is that I don't see this behavior under Windows XP 64 (at least not in the last week and I don't really know what happened before that)... but we have been testing this Windows application under Wine on Ubuntu 10.x. The Wine source for the guts of GetThreadContext contains an Access Denied return response on line 819 when an attempt to grab the thread state fails for some reason. I'm guessing, but it appears that Wine GetThreadStatus believes that a thread just might not be accessible repeatedly. Why that would be true after a SuspendThead is beyond me, but there's the code. Thoughts?

EDIT2: I lied again. I said we only saw the behavior on Wine. Nope... we have now found a Vista Ultimate system that seems to produce the same error (again, rarely). So, it appears that Wine and Windows agree on an obscure case. It also appears that the mere enabling of the Sysinternals Process monitor program aggravates the situation and causes the problem to appear on Windows XP 64; I suspect a Heisenbug. (The Process Monitor doesn't even exist on the Wine-tasting (:-) machine or the XP 64 system I use for development).

What on earth is it?

EDIT3: Sept 15 2010. I've added careful checking to the error return status, without otherwise disturbing the code, for SuspendThread, ResumeThread, and GetContext. I haven't seen any hint of this behavior on Windows systems since I did that. Haven't gotten back to the Wine experiment.

+1  A: 

There are some particular problems surrounding suspending a thread that owns a CriticalSection. I can't find a good reference to it now, but there is one mention of it on Raymond Chen's blog and another mention on Chris Brumme's blog. Basically, if you are unlucky enough to call SuspendThread while the thread is accessing an OS lock (e.g., heap lock, DllMain lock, etc.), then really strange things can happen. I would assume that this is the case that you are running into extremely rarely.

Does retrying the call to GetThreadContext work after a processor yield like Sleep(0)?

D.Shawley
AFAIK, it doesn't matter if a thread owns a CriticalSection. If you suspend it, you suspend it owning the CriticalSection; that's no worse than suspending owning another resource (e.g., a block of dynamically allocated storage) *unless* the suspender attempts to use that resource. We aren't doing that.
Ira Baxter
... Which thread are you suggesting is doing the Sleep(0), the suspender or the the suspendee? I can't see the point of the suspender doing Sleep(0), and the suspender can't make the suspendee do a Sleep(0) at his convenience, so I don't understand what is being suggested.
Ira Baxter
I looked at Chen's blog. Yes, if the suspender uses the same resource (including dynamic allocation) one can get deadlock. Our inspection thread doesn't do that (2 lines of code between SuspendThread and GetThreadContext, to set Context to what we want; see example coded added to my question). And, we aren't seeing deadlock; rather, we are seeing GetThreadContext produce error 5, which makes no sense.
Ira Baxter
After your latest comments, my guess is that one of the IO calls in the guts of `send_request` or `wait_reply` in _wine/dlls/ntdll/server.c_ is failing. Use a tracing tool like `strace` to trace the system calls and see which one is failing and why.
D.Shawley
@Shawley: Hmm. strace might give some insight. I'm pretty worried that changing the timing of the calls will change the behaviour, since the problem appears to threading-stop related, but it appears the experiment might be relatively easy. I'll look at giving it a try.
Ira Baxter
@Shawley: We struggled with strace and Wine. First, it produces an immense amount of output (100s of MB) just starting up Wine, but that's just an annoyance; it doesn't appear to produce any output from our Wine-emulated program. We're guessing that's because Wine forks a subprocess. We attempted to use -f with strace (to trace the fork) but we never see the start our program execution; Wine jsut hangs. So strace is unable to show us what is happening. (Wine will normally run our emulated program just fine modulo the occasional Access Denied response I've described). This is under Ubuntu 10.
Ira Baxter
A: 

Let me quote from Richter/Nassare's "Windows via C++ 5Ed" which may shed some light:

DWORD SuspendThread(HANDLE hThread);

Any thread can call this function to suspend another thread (as long as you have the thread's handle). It goes without saying (but I'll say it anyway) that a thread can suspend itself but cannot resume itself. Like ResumeThread, SuspendThread returns the thread's previous suspend count. A thread can be suspended as many as MAXIMUM_SUSPEND_COUNT times (defined as 127 in WinNT.h). Note that SuspendThread is asynchronous with respect to kernel-mode execution, but user-mode execution does not occur until the thread is resumed.

In real life, an application must be careful when it calls SuspendThread because you have no idea what the thread might be doing when you attempt to suspend it. If the thread is attempting to allocate memory from a heap, for example, the thread will have a lock on the heap. As other threads attempt to access the heap, their execution will be halted until the first thread is resumed. SuspendThread is safe only if you know exactly what the target thread is (or might be doing) and you take extreme measures to avoid problems or deadlocks caused by suspending the thread.

...

Windows actually lets you look inside a thread's kernel object and grab its current set of CPU registers. To do this, you simply call GetThreadContext:

BOOL GetThreadContext( HANDLE hThread, PCONTEXT pContext);

To call this function, just allocate a CONTEXT structure, initialize some flags (the structure's ContextFlags member) indicating which registers you want to get back, and pass the address of the structure to GetThreadContext. The function then fills in the members you've requested.

You should call SuspendThread before calling GetThreadContext; otherwise, the thread might be scheduled and the thread's context might be different from what you get back. A thread actually has two contexts: user mode and kernel mode. GetThreadContext can return only the user-mode context of a thread. If you call SuspendThread to stop a thread but that thread is currently executing in kernel mode, its user-mode context is stable even though SuspendThread hasn't actually suspended the thread yet. But the thread cannot execute any more user-mode code until it is resumed, so you can safely consider the thread suspended and GetThreadContext will work.

My guess is that GetThreadContext may fail if you just called SuspendThread, while the thread is in kernel mode, and the kernel is locking the thread context block at this time.

Maybe on multicore systems, one core is handling the kernel-mode execution of the thread that it's user mode was just suspended, keep locking the CONTEXT structure of the thread, exactly when the other core is calling GetThreadContext.

Since this behaviour is not documented, I suggest contacting microsoft.

Lior Kogan
A: 

Maybe a thread safety issue. Are you sure that the hComputeThread struct isn't changing out from under you? Maybe the thread was exiting when you called suspend? This may cause suspend to succeed, but by the time you call get context it is gone and the handle is invalid.

Mike
None of the answers seems to pan out. I'm handing you the points for at least a plausible explanation. I don't actually believe I have this problem but I'm adding a "GetHandleProperties" check to see if GetHandle complains.
Ira Baxter