views:

143

answers:

1

Hi all,

Experiencing intermittent issues, related to named events when processes are running in different user contexts: WaitForSingleObject (and WaitForMultipleObjects too) for such event handle fails with WAIT_FAILED (GetLastError returns 6 - Invalid handle value).

We have an application to schedule tasks on Windows machines under user accounts, and issue happens after some tasks are completed.

Service part of application (JobManager) starting executable (JobLeader) under user account (CreateProcessAsUser) to run user task, and waiting for named event to be signaled. Manual reset named event is created by JobLeader in the "Global\" namespace and signaled when user task is completed.

JobManager waiting in the loop, calling WFMO(WaitForMultipleObjects) with delay of 10 seconds, to see if named event or JobLeader process handle are signaled.

Periodically named event handle, opened by JobManager through OpenEvent API call, causes WFMO (WFSO is also called after to identify which handle is broken) to return WAIT_FAILED, with error code 6 - "Invalid handle value". After reopening the event, this error may gone, or may not - WFMO may again returns WAIT_FAILED because of invalid handle value.

Interesting, that it may pass few dozens tasks without this error, and then - sequentially few tasks have it. Tasks used for testing are identical - just a cmd.exe script, dumping environment.

Anyone have ideas about this?

Regards, Alex

+2  A: 

Do you create the event in your JobManager and then open it in the 'JobLeader'? If not, how do you communicate the event handle (and/or name) between the two processes?

My gut tells me it's a race condition...

Len Holgate
Hi Len,As I stated above - "Manual reset named event is created by JobLeader in the "Global\" namespace" - using CreateEvent().JobLeader is controlling user task, and he notifies JobManager when it finished, signaling this named event via SetEvent().JobManager especially using "OpenEvent" to open this named event. Even more, this is done in the cycle for 150 seconds - with trying to open it every 10 seconds, if previous try failed (for the case, if event is not yet created by JobLeader - but it never happened yet).Yep... My guts tells same... Especially because ii intermittent...
Alex Wakizashi
In addition - there is one more communication channel between JobLeader and JobManager - RPC through local socket. To establish this connection I also use named event, set when RPC server is started by JobLeader, so JobManager can connect to it and pass task info(environment, command line and arguments, working dir, etc.).Event name is formed using unique task ID(numeric) and JobLeader process ID. BTW - never seen problems with named event, used for signaling RPC server readiness...
Alex Wakizashi
Why don't you create the event in the JobManager and then open it in the spawned process? There's no possibility of a race condition then. Since your failure is at the Wait() stage, are you checking for errors after the Open? Are you saying you Open the event OK and then fail at the Wait()? It could be security permissions causing the Open to fail ?
Len Holgate
JobManager is running as a system account, under higher privileges, so it should be able to open any sync object, theoretically, which is created with default priorities. Especially, because it's opened only to read(get state). And yes, of course - as I wrote in first comment, it checking OpenEvent for errors during 150 seconds. (15 tries to open it) - but usually it's opened at first try. Thanks a lot for the idea about creating event in JobManager... Will try it :)
Alex Wakizashi
It's just that your question says that it's WaitForMultipleObjects() that's failing with invalid handle... How is it getting an invalid handle if the Open() works and you don't close the object before or during the wait?? Could you (and the spawned process) be closing the object during the wait??
Len Holgate
That's exactly same question, which is blowing my mind! :) But no - event is not closed. Event closed only after JobLeader got termination command over RPC, not earlier. And logs confirms that...
Alex Wakizashi
Hi Len! Have found to one more case - related to race conditions too... Sometimes, while the user task is working (And system is under high load with multiple parallel processes), JobLeader disappears without traces. As result, task became lost (User task process continues to run - but without any control over it). So, I going to rewrite that part, to make JobLeader restartable, and be able to catch the back the control over user task. And thanks for the hints!
Alex Wakizashi
You should be able to redesign to avoid the races; in general it's always best to create the resources that the spawned process will use in the spawning process and then connect to them in the spawned process (with checks to ensure you NEVER create new resources in the spawned process). This works well for me. By the way, I assume you're using the Win32 Job API to manage your spawned processes? If not, you should be, it works really well and is great for this kind of thing.
Len Holgate