views:

96

answers:

2

I have some immutable data structures that I would like to manage using reference counts, sharing them across threads on an SMP system.

Here's what the release code looks like:

void avocado_release(struct avocado *p)
{
    if (atomic_dec(p->refcount) == 0) {
        free(p->pit);
        free(p->juicy_innards);
        free(p);
    }
}

Does atomic_dec need a memory barrier in it? If so, what kind of memory barrier?

Additional notes: The application must run on PowerPC and x86, so any processor-specific information is welcomed. I already know about the GCC atomic builtins. As for immutability, the refcount is the only field that changes over the duration of the object.

+2  A: 

Are you intending to implement your own atomic_dec or are you just wondering whether a system-supplied function will behave as you want?

As a general rule, system-supplied atomic increment/decrement facilities will apply whatever memory barriers are required to just do the right thing. You generally don't have to worry about memory barriers unless you are doing something wacky like implementing your own lock-free data structures or an STM library.

Marcelo Cantos
I want to know whether memory barriers are necessary in this case, and why.
Dietrich Epp
+1 "something" will be required to synchronise access to the refcount field. Whether that "something" is literally a memory barrier, or another similar manipulation of caches, requires trawling through CPU specifications and/or checking the emitted code. It needn't be a full cache flush, perhaps the CPU invalidates just the single cache line that's used. The compiler and CPU each have to ensure instructions aren't re-ordered across the decrement, but the conditional based on the result of the decrement pretty much ensures that anyway.
Steve Jessop
@Dietrich: in this case, no, because the subsequent operations are conditional on the outcome of the decrement, and there is thus no possibility of the compiler reordering things in a problematic way. Besides, the nature of a refcount is such that, when the count reaches zero, only one thread can possibly have access to the object in question (absent bugs, that is).
Marcelo Cantos
AFAIK the memory barrier is only for instruction sequencing and shouldnot have to involve the cache at all.
Per Ekman
@Per: yes, without knowing why the questioner wants to know about memory barriers, it's not clear whether the cache behaviour is relevant, or whether he's just asking about the possibility of micro-code instruction re-ordering. I've never quite figured when you need to prevent instruction re-ordering, but you don't care about cache freshness. But I'm prepared to believe it happens :-)
Steve Jessop
@Steve: I only mention it because people seem to worry unduly aboutthe cache when discussing multithreading correctness. Modernmultiprocessors like the x86 systems will take care of it all inhardware.In a cache-coherent system you only need to worry about cacheflushing if you're hacking the kernel or a driver for a devicedoing DMA transfers.It's important for perfomance of course, but not for correctness.
Per Ekman
Sure: do you happen to know whether multicore PowerPC necessarily has coherent cache? But you're right, atomic is atomic, and whether it's implemented with explicit cache invalidation or coherent cache, or whatever, rarely affects application code. There are things you can do assuming coherent cache: whether you should or not is questionable.
Steve Jessop
+1  A: 

On x86, it will turn into a lock prefixed assembly instruction, like LOCK XADD.
Being a single instruction, it is non-interruptible. As an added "feauture", the lock prefix results in a full memory barrier:

"...locked operations serialize all outstanding load and store operations (that is, wait for them to complete)." ..."Locked operations are atomic with respect to all other memory operations and all externally visible events. Only instruction fetch and page table accesses can pass locked instructions. Locked instructions can be used to synchronize data written by one processor and read by another processor." - Intel® 64 and IA-32 Architectures Software Developer’s Manual, Chapter 8.1.2.

A memory barrier is in fact implemented as a dummy LOCK OR or LOCK AND in both the .NET and the JAVA JIT on x86/x64.
So you have a full fence on x86 as an added bonus, whether you like it or not. :)

On PPC, it is different. An LL/SC pair - lwarx & stwcx - with a subtraction inside can be used to load the memory operand into a register, subtract one, then either write it back if there was no other store to the target location, or retry the whole loop if there was. An LL/SC can be interrupted.
It also does not mean an automatic full fence.
This does not however compromise the atomicity of the counter in any way.
It just means that in the x86 case, you happen to get a fence as well, "for free".
On PPC, one can insert a full fence by emitting a (lw)sync instruction.

All in all, explicit memory barriers are not necessary for the atomic counter to work properly.

andras