tags:

views:

518

answers:

5

Herlihy and Shavit's book (The Art of Multiprocessor Programming) solution to memory reclamation uses Java's AtomicStampedReference<T>;.

To write one in C++ for the x86_64 I imagine requires at least a 12 byte swap operation - 8 for a 64bit pointer and 4 for the int.

Is there x86 hardware support for this and if not, any pointers on how to do wait-free memory reclamation without it?

+2  A: 

Windows gives you a bunch of Interlocked functions that are atomic and can probably be used to do what you want. Similar functions exist for other platforms, and I believe Boost has an interlocked library as well.

Your question isn't super clear and I don't have a copy of Herlihy and Shavit laying around. Perhaps if you elaborated or gave psuedo code outlining what you want to do, we can give you a more specific answer.

jeffamaphone
Yeah, I wanted to compare a 64bit ptr + 32bit int, with another, and swap, both together, atomically etc etc. Windows has at max InterlockedCompare64Exchange128, I was after a hypothetical InterlockedCompare128Exchange128 (or 96 really but...). The problem is that such a function has have to have a signature, and what is there that's 128 bytes? So neither Windows nor GCC have one. But luckily as I explain above, 64 bits is enough if you're willing to 'risk' a counter of just 65536 (16bits); just fold it into and out of the pointer appropriately, x86_64 seems to use only 48 bits of address spc.
JDonner
You want __InterlockedCompareExchange128;http://msdn.microsoft.com/en-us/library/bb514094.aspxNote the double leading underscores.
Blank Xavier
+1  A: 

Ok hopefully, I have the book,

For others that may provides answers, the point is to implement this class :

class AtomicReference<T>{
  public:
    void set(T *ref, int stamp){ ... }
    T *get(int *stamp){ ... }
  private:  
    T *_ref; 
    int _stamp;

};

in a lock-free way so that :

  • set() updates the reference and the stamp, atomicly.
  • get() returns the reference and set *stamp to the stamp corresponding to the reference.

JDonner please, correct me if I am wrong.

Now my answer : I don't think you can do it without a lock somewhere (a lock can be while(test_and_set() != ..)). Therefore there is no lockfree algorithm for this. This would mean that it is possible to build an N-bythe register a lock-free way for any N.

If you look at the book pragma 9.8.1, The AtomicMarkableReference wich is the same with a single bit insteam of an integer stamp. The author suggest to "steal" a bit from a pointer to extract the mark and the pointer from a single word (alsmost quoted) This obviously mean that they want to use a single atomic register to do it.

However, there may be a way to bluid a wait-free memory reclamation without it. I don't know.

Ben
"Therefore there is no lockfree algorithm for this. " Well, an AtomicReference is used as part of algorithms, so if several threads are using it, as long as some thread makes progress (see page 99 of the same book) it's still lock-free. Not wait-free though as I originally said.
JDonner
+3  A: 

Yes, there is hardware support, though I don't know if it is exposed by C++ libraries. Anyway, if you don't mind doing some low-level unportable assembly language trickery - look up the CMPXCHG16B instruction in Intel manuals.

stormsoul
Compare and swap or any other Read-Modify-Write instruction are the primitive for any synchronization on modern processor. Using them instead of higher level synchronization object (mutexes or semaphores) does not mean your algorithm is lock-free.However, if you use gcc or icc I suggest you have a look on this instead of using asm instructions http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html
Ben
GCC has no built-in atomic for double-word CAS. You have to use __asm__.
Blank Xavier
+2  A: 

Yes, x64 supports this; you need to use CMPXCHG16B.

Blank Xavier
+1  A: 

You can save a bit on memory by relying on the fact that the pointer will use less than 64 bits. First, define a compare&set function (this ASM works in GCC & ICC):

inline bool CAS_ (volatile uint64_t* mem, uint64_t old_val, uint64_t new_val)
{
  unsigned long old_high = old_val >> 32, old_low = old_val;
  unsigned long new_high = new_val >> 32, new_low = new_val;

  char res = 0;
  asm volatile("lock; cmpxchg8b (%6);"
               "setz %7; "
               : "=a" (old_low),  // 0
                 "=d" (old_high)  // 1
               : "0" (old_low),   // 2
                 "1" (old_high),  // 3
                 "b" (new_low),   // 4
                 "c" (new_high),  // 5
                 "r" (mem),       // 6
                 "m" (res)        // 7
               : "cc", "memory");
  return res;
}

You'll then need to build a tagged-pointer type. I'm assuming a 40-bit pointer with a cacheline-width of 128-bytes (like Nehalem). Aligning to the cache-line will give enormous speed improvements by reducing false-sharing, contention, etc.; this has the obvious trade-off of using a lot more memory, in some situations.

template <typename pointer_type, typename tag_type, int PtrAlign=7, int PtrWidth=40>
struct tagged_pointer
{
  static const unsigned int PtrMask = (1 << (PtrWidth - PtrAlign)) - 1;
  static const unsigned int TagMask = ~ PtrMask;

  typedef unsigned long raw_value_type;

  raw_value_type raw_m_;

  tagged_pointer () : raw_m_(0) {}
  tagged_pointer (pointer_type ptr) { this->pack(ptr, 0); }
  tagged_pointer (pointer_type ptr, tag_type tag) { this->pack(ptr, tag); }

  void pack (pointer_type ptr, tag_type tag)
  {
    this->raw_m_ = 0;
    this->raw_m_ |= ((ptr >> PtrAlign) & PtrMask);
    this->raw_m_ |= ((tag << (PtrWidth - PtrAlign)) & TagMask);
  }

  pointer_type ptr () const
  {
    raw_value_type p = (this->raw_m_ & PtrMask) << PtrAlign;
    return *reinterpret_cast<pointer_type*>(&p);
  }

  tag_type tag () const
  {
    raw_value_type t = (this->raw_m_ & TagMask) >> (PtrWidth - PtrAlign_;
    return *reinterpret_cast<tag_type*>(&t);
  }
};

I haven't had a chance to debug this code, so you'll need to do that, but this is the general idea.

thechao