views:

200

answers:

2

On Linux I'm using shmget and shmat to setup a shared memory segment that one process will write to and one or more processes will read from. The data that is being shared is a few megabytes in size and when updated is completely rewritten; it's never partially updated.

I have my shared memory segment laid out as follows:

    -------------------------
    | t0 | actual data | t1 |
    -------------------------

where t0 and t1 are copies of the time when the writer began its update (with enough precision such that successive updates are guaranteed to have differing times). The writer first writes to t1, then copies in the data, then writes to t0. The reader on the other hand reads t0, then the data, then t1. If the reader gets the same value for t0 and t1 then it considers the data consistent and valid, if not, it tries again.

Does this procedure ensure that if the reader thinks the data is valid then it actually is?

Do I need to worry about out-of-order execution (OOE)? If so, would the reader using memcpy to get the entire shared memory segment overcome the OOE issues on the reader side? (This assumes that memcpy performs it's copy linearly and ascending through the address space. Is that assumption valid?)

+4  A: 

Modern hardware is actually anything but sequentially consistent. Thus, this is not guaranteed to work as such if you don't execute memory barriers at the appropriate spots. Barriers are needed because the architecture implements a weaker shared memory consistency model than sequential consistency. This as such has nothing to do with pipelining or OoO, but with allowing multiple processors to efficiently access the memory system in parallel. See e.g. Shared memory consistency models: A tutorial. On a uniprocessor, you don't need barriers, because all the code executes sequentially on that one processor.

Also, there is no need to have two time fields, a sequence count is probably a better choice as there is no need to worry whether two updates are so close that they get the same timestamp, and updating a counter is much faster than getting the current time. Also, there is no chance that the clock moves backwards in time which might happen e.g. when ntpd adjusts for clock drift. Though this last problem can be overcome on Linux by using clock_gettime(CLOCK_MONOTONIC, ...). Another advantage of using sequence counters instead of timestamps is that you need only one sequence counter. The writer increments the counter both before writing the data, and after the write is done. Then the reader reads the sequence number, checks that it's even, and if so, reads the data, and finally then reads the sequence number again and compares to the first sequence number. If the sequence number is odd, it means a write is in progress, and there is no need to read the data.

The Linux kernel uses a locking primitive called seqlocks that do something like the above. If you're not afraid of "GPL contamination", you can google for the implementation; As such it's trivial, but the trick is getting the barriers correct.

janneb
Assuming I used the appropriate [SLM]FENCE assembly instructions before and/or after my reads and writes, wouldn't I still need 2 copies of the sequence count (seqc)? If I have just one the writer could start by setting seqc, begin writing data, then the OS switches processes and the reader starts, reads seqc, and most of data (some updated, some not), then switch back to writer who finishes, the back to reader who finishes. The reader gets the same seqc both time, but still ends up with invalid data.
Bribles
See the wikipedia link; the writer increments seqc before starting the update, and after the update is done. Then the reader checks whether seqc is even or odd, in addition to comparing seqc before and after reading.
janneb
@andras: Indeed, which is why I wrote "...a sequence count is probably a better choice as there is no need to worry whether two updates are so close that they get the same timestamp..."
janneb
@janneb: It seems I completely missed a sentence in between "faster than ..." and "This last problem" - which of course makes "last problem" mean something completely different. Sorry, my bad. :P
andras
@janneb: On the most widespread desktop hardware (x86/x64) the memory model is actually strong enough to support this without explicit fences. The beauty of the the OP's approach with two fields is that you do not need any barriers on x86/x64. If the *compiler* does not reorder/optimize away memory accesses, this will work OOTB on x86/x64.References:1. http://www.bluebytesoftware.com/blog/2009/06/05/AScalableReaderwriterSchemeWithOptimisticRetry.aspx 2. http://blogs.msdn.com/kangsu/archive/2007/07/16/volatile-acquire-release-memory-fences-and-vc2005.aspx
andras
@andras: The references use volatile in C# and VC++, which as you mention yourself in your answer, have different semantics in C# and VC++ than in ISO C++ and GCC. I believe Java has similar volatile semantics as C# and VC++, and it seems to use XCHG or MFENCE for volatile stores. However, you are correct that x86 provides a pretty strong model. I'm not 100% sure, but I think you're right that this particular algorithm doesn't need any fences on x86, since the architecture guarantees store ordering (stores from a processor are always seen in program order by other procs).
janneb
@janneb: Even in the Java world, in terms of `volatile`, a `LOCK:ADD [TopOfStack],0` might only be needed if a volatile store is followed by volatile read. The other barriers are no-op in the JSR-133 cookbook. To quote: "...a StoreLoad is strictly necessary only for separating stores from subsequent loads of the *same* location(s) as were stored before the barrier." The behaviour you see can be explained by the JIT going by the *conservative* strategy that Doug Lea has given. It might be that your JIT emits a fence for *all* volatile stores. It does not need to, however.
andras
@janneb: ...and in this particular case, they are really unnecessary on x86/x64. (I forgot to add the link to the cookbook - not that it is so hard to find it with google -, but here it is: http://gee.cs.oswego.edu/dl/jmm/cookbook.html )
andras
@janneb: The only thing necessary here is making sure the *compiler* emits the plainest - and seemingly inefficient - machine code one would expect and doesn't do cunning optimizations.
andras
+1  A: 

Joe Duffy gives the exact same algorithm and calls it: "A scalable reader/writer scheme with optimistic retry".

It works.
You need two sequence number fields.

You need to read and write them in opposite order.
You might need to have memory barriers in place, depending on the memory ordering guarantees of the system.

Specifically, you need read acquire and store release semantics for the readers and writers when they access t0 or t1 for reading and writing respectively.

What instructions are needed to achieve this, depends on the architecture. E.g. on x86/x64, because of the relatively strong guarantees one needs no machine specific barriers at all in this specific case*.

* one still needs to ensure that the compiler/JIT does not mess around with loads and stores , e.g. by using volatile (that has a different meaning in Java and C# than in ISO C/C++. Compilers may differ, however. E.g. using VC++ 2005 or above with volatile it would be safe doing the above. See the "Microsoft Specific" section. It can be done with other compilers as well on x86/x64. The assembly code emitted should be inspected and one must make sure that accesses to t0 and t1 are not eliminated or moved around by the compiler.)

As a side note, if you ever need MFENCE, lock or [TopOfStack],0 might be a better option, depending on your needs.

andras
Do you happen to know if this technique has a name? Also, the memory barriers are needed due to modern pipelined overly complex architectures right? If this code were running on an 8086 or a really simple micro-controller then wouldn't the memory barriers be unnecessary?Finally, if the data copying is done with memcpy, and the function call is between the timestamp reads/writes, wouldn't the moving of megs of data in the memcpy sorta always make the timestamp ops sequential?
Bribles
@Bribles: Barriers are needed because the architecture implements a weaker shared memory consistency model than sequential consistency. This as such has nothing to do with pipelining or OoO, but with allowing multiple processors to efficiently access the memory system in parallel. See e.g. http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-95-7.pdf . On a uniprocessor, you don't need barriers, because all the code executes sequentially on that one processor.
janneb
@Bribles: The problem with timestamps is that 1) If performance matters, getting the current time is slower than incrementing the counter 2) What if the time changes backwards (e.g. ntpd moving the clock backwards to compensate for clock drift?). Well, on Linux 2) at least can be fixed by using clock_gettime(CLOCK_MONOTONIC, ...).
janneb
@janneb: I haven't even remotely considered the possibility of using a real clock value here (even CLOCK_MONOTONIC will just slow things down without providing any added value). I think the misunderstanding stems from that. :S You are right, I should have been more precise in terminology in the first place. :P
andras