views:

438

answers:

6

I'm looking for an equivalent of LWARX and STWCX (as found on the PowerPC processors) or a way to implement similar functionality on the x86 platform. Also, where would be the best place to find out about such things (i.e. good articles/web sites/forums for lock/wait-free programing).


Edit
I think I might need to give more details as it is being assumed that I'm just looking for a CAS (compare and swap) operation. What I'm trying to do is implement a lock-free reference counting system with smart pointers that can be accessed and changed by multiple threads. I basically need a way to implement the following function on an x86 processor.

int* IncrementAndRetrieve(int **ptr)
{
  int val;
  int *pval;
  do
  {
    // fetch the pointer to the value
    pval = *ptr;

    // if its NULL, then just return NULL, the smart pointer
    // will then become NULL as well
    if(pval == NULL)
      return NULL;

    // Grab the reference count
    val = lwarx(pval);

    // make sure the pointer we grabbed the value from
    // is still the same one referred to by  'ptr'
    if(pval != *ptr)
      continue;

    // Increment the reference count via 'stwcx' if any other threads
    // have done anything that could potentially break then it should
    // fail and try again
  } while(!stwcx(pval, val + 1));
  return pval;
}

I really need something that mimics LWARX and STWCX fairly accurately to pull this off (I can't figure out a way to do this with the CompareExchange, swap or add functions I've so far found for the x86).

Thanks

+1  A: 

x86 does not directly support "optimistic concurrency" like PPC does -- rather, x86's support for concurrency is based on a "lock prefix", see here. (Some so-called "atomic" instructions such as XCHG actually get their atomicity by intrinsically asserting the LOCK prefix, whether the assembly code programmer has actually coded it or not). It's not exactly "bomb-proof", to put it diplomatically (indeed, it's rather accident-prone, I would say;-).

Alex Martelli
+1  A: 

You're probably looking for the cmpxchg family of instructions.

You'll need to precede these with a lock instruction to get equivalent behaviour.

Have a look here for a quick overview of what's available.

You'll likely end up with something similar to this:

mov ecx,dword ptr [esp+4]
mov edx,dword ptr [esp+8]
mov eax,dword ptr [esp+12]
lock cmpxchg dword ptr [ecx],edx
ret 12

You should read this paper...

Edit

In response to the updated question, are you looking to do something like the Boost shared_ptr? If so, have a look at that code and the files in that directory - they'll definitely get you started.

Michael van der Westhuizen
Those 2 links are quite good (actually stumbled across those same 2 pages a few days ago), but unfortunately not what I'm looking for (I updated the question to better reflect this)
Grant Peters
+2  A: 

As Michael mentioned, what you're probably looking for is the cmpxchg instruction.

It's important to point out though that the PPC method of accomplishing this is known as Load Link / Store Conditional (LL/SC), while the x86 architecture uses Compare And Swap (CAS). LL/SC has stronger semantics than CAS in that any change to the value at the conditioned address will cause the store to fail, even if the other change replaces the value with the same value that the load was conditioned on. CAS, on the other hand, would succeed in this case. This is known as the ABA problem (see the CAS link for more info).

If you need the stronger semantics on the x86 architecture, you can approximate it by using the x86s Double Compare And Swap (DCAS) instruction cmpxchg8b, or cmpxchg16b under x86_64. This allows you to atomically swap two consecutive 'natural sized' words at once, instead of just the usual one. The basic idea is one of the two words contains the value of interest, and the other one contains an always incrementing 'mutation count'. Although this does not technically eliminate the problem, the likelihood of the mutation counter to wrap between attempts is so low that it's a reasonable substitute for most purposes.

johne
DCAS almost looks right, except i need to change 1 word only if a pointer to that word doesn't change while doing this (thats a little confusing, hopefully the update to the question helps clarify this).
Grant Peters
I managed to find a workaround using DCAS, its not foolproof, as it uses a unique ID (4 bytes in size) but the chances of it breaking are slim because both the 4 byte UID and the 4 byte counter adjacent to it must be replicated exactly. This is only a problem if something deletes the object reassigns the memory to something else and then manages to duplicate those 8 bytes while another thread is trying to copy a pointer, which is a relatively short operation (operation wise that is, length is only long enough if the thread is interrupted)
Grant Peters
A: 

What you are trying to do will not work the way you expect. What you implemented above can be done with the InterlockedIncrement function (Win32 function; assembly: XADD).

The reason that your code does not do what you think it does is that another thread can still change the value between the second read of *ptr and stwcx without invalidating the stwcx.

Ringding
the "if(pval != ptr) continue;"is safe because whenever another thread changes a smart pointer, it will also alter the counter that its pointing to, hence, it will invalidate the stwcx as that value does get changed, and that is what is being monitored for change (just requires some careful structuring)
Grant Peters
You really need to post the other side too, then. I just tried to build an answer but there was too much guessing involved. Usually, these kinds of problems can be solved using CAS.
Ringding
A: 

if you are on 64 bits and limit yourself to say 1tb of heap, you can pack the counter into the 24 unused top bits. if you have word aligned pointers the bottom 5 bits are also available.

int* IncrementAndRetrieve(int **ptr)
{
  int val;
  int *unpacked;
  do
  {   
    val = *ptr;
    unpacked = unpack(val);

    if(unpacked == NULL)
      return NULL;
    // pointer is on the bottom
  } while(!cas(unpacked, val, val + 1));
  return unpacked;
}
miaubiz
Memory doesn't have to be allocated at the lowest heap, so you can't be sure of this, unless you're specifying the addresses yourself (which i am), unfortunately, i'm not on a 64-bit platform, but this might be useful in the future.
Grant Peters
A: 

Don't know if LWARX and STWCX invalidate the whole cache line, CAS and DCAS do. Meaning that unless you are willing to throw away a lot of memory (64 bytes for each independent "lockable" pointer) you won't see much improvement if you are really pushing your software into stress. The best results I've seen so far were when people consciously casrificed 64b, planed their structures around it (packing stuff that won't be subject of contention), kept everything alligned on 64b boundaries, and used explicit read and write data barriers. Cache line invalidation can cost approx 20 to 100 cycles, making it a bigger real perf issue then just lock avoidance.

Also, you'd have to plan different memory allocation strategy to manage either controlled leaking (if you can partition code into logical "request processing" - one request "leaks" and then releases all it's memory bulk at the end) or datailed allocation management so that one structure under contention never receives memory realesed by elements of the same structure/collection (to prevent ABA). Some of that can be very counter-intuitive but it's either that or paying the price for GC.

ZXX
Yeah, this is kind of a non issue these days, in the end I've opted for more manual management and training the rest of the coders in the company how to properly do multi-threading via a couple of lock free structures that facilitate inter-thread communication.
Grant Peters