views:

258

answers:

2

I have a thread which produces data in the form of simple object (record). The thread may produce a thousand records for each one that successfully passes a filter and is actually enqueued. Once the object is enqueued it is read-only.

I have one lock, which I acquire once the record has passed the filter, and I add the item to the back of the producer_queue.

On the consumer thread, I acquire the lock, confirm that the producer_queue is not empty, set consumer_queue to equal producer_queue, create a new (empty) queue, and set it on producer_queue. Without any further locking I process consumer_queue until it's empty and repeat.

Everything works beautifully on most machines, but on one particular dual-quad server I see in ~1/500k iterations an object that is not fully initialized when I read it out of consumer_queue. The condition is so fleeting that when I dump the object after detecting the condition the fields are correct 90% of the time.

So my question is this: how can I assure that the writes to the object are flushed to main memory when the queue is swapped?

Edit:

On the producer thread: (producer_queue above is m_fillingQueue; consumer_queue above is m_drainingQueue)

private void FillRecordQueue() {
  while (!m_done) {
    int count;
    lock (m_swapLock) {
      count = m_fillingQueue.Count;
    }
    if (count > 5000) {
      Thread.Sleep(60);
    } else {
      DataRecord rec = GetNextRecord();
      if (rec == null) break;
      lock (m_swapLock) {
        m_fillingQueue.AddLast(rec);
      }
    }
  }
}

In the consumer thread:

private DataRecord Next(bool remove) {
  bool drained = false;
  while (!drained) {
    if (m_drainingQueue.Count > 0) {
      DataRecord rec = m_drainingQueue.First.Value;
      if (remove) m_drainingQueue.RemoveFirst();
      if (rec.Time < FIRST_VALID_TIME) {
        throw new InvalidOperationException("Detected invalid timestamp in Next(): " + rec.Time + " from record " + rec);
      }
      return rec;
    } else {
      lock (m_swapLock) {
        m_drainingQueue = m_fillingQueue;
        m_fillingQueue = new LinkedList<DataRecord>();
        if (m_drainingQueue.Count == 0) drained = true;
      }
    }
  }
  return null;
}

The consumer is rate-limited, so it can't get ahead of the consumer.

The behavior I see is that sometimes the Time field is reading as DateTime.MinValue; by the time I construct the string to throw the exception, however, it's perfectly fine.

+2  A: 

Have you tried the obvious: is microcode update applied on the fancy 8-core box(via BIOS update)? Did you run Windows Updates to get the latest processor driver?

At the first glance, it looks like you're locking your containers. So I am recommending the systems approach, as it sound like you're not seeing this issue on a good-ol' dual core box.

GregC
A: 

Assuming these are in fact the only methods that interact with the m_fillingQueue variable, and that DataRecord cannot be changed after GetNextRecord() creates it (read-only properties hopefully?), then the code at least on the face of it appears to be correct.

In which case I suggest that GregC's answer would be the first thing to check; make sure the failing machine is fully updated (OS / drivers / .NET Framework), becasue the lock statement should involve all the required memory barriers to ensure that the rec variable is fully flushed out of any caches before the object is added to the list.

jerryjvl
Windows Update FTW. While there were no available microcode or processor driver updates, installing .NET SP1 resolved the issue.
DavidM
Good to hear. My boss is always skeptical when I say things like that. Fix it in software, he says. One for the team.
GregC
May I inquire about which piece of .NET was upgraded? (1.1, 2.0, 3.0, 3.5?)
GregC
From David's comment I assume it was 3.5 SP1 that solved the issue.
jerryjvl