views:

731

answers:

3

I wrote a multi-threaded windows application where thread:
A – is a windows form that handles user interaction and process the data from B.
B – occasionally generates data and passes it two A.

A thread safe queue is used to pass the data from thread B to A. The enqueue and dequeue functions are guarded using a windows critical section objects.

If the queue is empty when the enqueue function is called, the function will use PostMessage to tell A that there is data in the queue. The function checks to make sure the call to PostMessage is executed successfully and repeatedly calls PostMessage if it is not successful (PostMessage has yet to fail).

This worked well for quite some time until one specific computer started to lose the occasional message. By lose I mean that, PostMessage returns successfully in B but A never receives the message. This causes the software to appear frozen.

I have already come up with a couple acceptable workarounds. I am interesting in knowing why windows is loosing these messages and why this is only happening on the one computer.

Here is the relevant portions of the code.

// Only called by B
procedure TSharedQueue.Enqueue(AItem: TSQItem);
var
 B: boolean;
begin
  EnterCriticalSection(FQueueLock);
  if FCount > 0 then
    begin
      FLast.FNext := AItem;
      FLast := AItem;
    end
  else
    begin
      FFirst := AItem;
      FLast := AItem;
    end;

  if (FCount = 0) or (FCount mod 10 = 0) then // just in case a message is lost
    repeat
      B := PostMessage(FConsumer, SQ_HAS_DATA, 0, 0);
      if not B then 
  Sleep(1000); // this line of code has never been reached
    until B;

  Inc(FCount);
  LeaveCriticalSection(FQueueLock);
end;

// Only called by A 
function TSharedQueue.Dequeue: TSQItem;
begin
  EnterCriticalSection(FQueueLock);
  if FCount > 0 then
    begin
      Result := FFirst;
      FFirst := FFirst.FNext;
      Result.FNext := nil;
      Dec(FCount);
    end
  else
    Result := nil;
  LeaveCriticalSection(FQueueLock);
end;

// procedure called when SQ_HAS_DATA is received
procedure TfrmMonitor.SQHasData(var AMessage: TMessage);
var
  Item: TSQItem;
begin
  while FMessageQueue.Count > 0 do
    begin
      Item := FMessageQueue.Dequeue;
      // use the Item somehow
    end;
end;
A: 

Could there be a second instance unknowingly running and eating the messages, marking them as handled?

scottm
+1  A: 

If the queue is empty when the enqueue function is called, the function will use PostMessage to tell A that there is data in the queue.

Are you locking the message queue before checking the queue size and issuing the PostMessage? You may be experiencing a race condition where you check the queue and find it non-empty when in fact A is processing the very last message and is about to go idle.

To see if you're in fact experiencing a race condition and not a problem with PostMessage, you could switch to using an event. The worker thread (A) would wait on the event instead of waiting for a message. B would simply set that event instead of posting a message.

This worked well for quite some time until one specific computer started to lose the occasional message.

By any chance, does the number of CPUs or cores that this specific computer have different than the others where you see no problem? Sometimes when you switch from a single-CPU machine to a machine with more than one physical CPU/core, new race conditions or deadlocks may arise.

Ates Goral
Your answer doesn't make sense to me: the message queue, to which the PostMessage API posts messages, is controlled by the O/S (not by the application) and cannot be "locked" by the application.
ChrisW
@ChrisW: This is the statement by the OP: "A thread safe queue is used to pass the data from thread B to A. The enqueue and dequeue functions are guarded using a windows critical section objects."
Ates Goral
From what I understand, a separate thread safe queue is being used to queue up messages and a Windows message sent using PostMessage is only used as a signal the thread to wake up and process the queued messages.
Ates Goral
+2  A: 

Is FCount also protected by FQueueLock? If not, then your problem lies with FCount being incremented after the posted message is already processed.

Here's what might be happening:

  1. B enters critical section
  2. B calls PostMessage
  3. A receives the message but doesn't do anything since FCount is 0
  4. B increments FCount
  5. B leaves critical section
  6. A sits there like a duck

A quick remedy would be to increment FCount before calling PostMessage.

Keep in mind that things can happen quicker than one would expect (i.e. the message posted with PostMessage being caught and processed by another thread before you have a chance to increment FCount a few lines later), especially when you're in a true multi-threaded environment (multiple CPUs). That's why I asked earlier if the "problem machine" had multiple CPUs/cores.

An easy way to troubleshoot problems like these is to scaffold the code with additonal logging to log every time you enter a method, enter/leave a critical section etc. Then you can analyze the log to see the true order of events.

On a separate note, a nice little optimization that can be done in a producer/consumer scenario like this is to use two queues instead of one. When the consumer wakes up to process the full queue, you swap the full queue with an empty one and just lock/process the full queue while the new empty queue can be populated without the two threads trying to lock each other's queues. You'd still need some locking in the swapping of the two queues though.

Ates Goral
Dude. I learned something here. Thanks!
Warren P