views:

732

answers:

7

I want to use "_test_and_set lock" assembly language implementation with atomic swap assembly instruction in my C/C++ program.

class LockImpl 
{
  public:
  static void lockResource(DWORD resourceLock )
  {
    __asm 
    {
      InUseLoop:  mov     eax, 0;0=In Use
                  xchg    eax, resourceLock
                  cmp     eax, 0
                  je      InUseLoop
    }

  }

  static void unLockResource(DWORD resourceLock )
  {
    __asm 
    {
      mov resourceLock , 1 
    }   

  }
};

This works but there is a bug in here.

The problem is that i want to pass DWORD * resourceLock instead of DWORD resourceLock.

So question is that how to pass a pointer from C/C++ to assembly and get it back. ?

thanks in advance.

Regards, -Jay.

P.S. this is done to avoid context switches between user space and kernel space.

A: 

Look at your compiler documentation to find out how to print the generated assembly language for functions.

Print the assembly language for this function:

static void unLockResource(DWORD resourceLock )
{
  resourceLock = 0;
  return;
}

This may not work because the compiler can optimize the function and remove all the code. You should change the above function to pass a pointer to resourceLock and then have the function set the lock. Print the assembly of this working function.

Thomas Matthews
+3  A: 

Apparently, you are compiling with MSVC using inline assembly blocks in your C++ code.

As a general remark, you should really use compiler intrinsics as inline assembly has no future: it's no more supported my MS compilers when compiling for x64.

If you need to have functions fine tuned in assembly, you will have to implement them in separate files.

Gregory Pakosz
No more supported by whom? Your link is to MS, but the entire planet isn't using the MS compiler.
Ken White
inline assembly (that is assembly blocks in C/C++ code) is no more supported by MS. and as far as I can see, it really looks like his code is inline assembly compiled with MSVC
Gregory Pakosz
@Ken do you prefer it rephrased like this?
Gregory Pakosz
and he confirmed in other comments he's using MSVC anyway...
Gregory Pakosz
@Ken , @Gregory : Yes .. that's correct. ! I hate Windowz .. but my management thinks that it helps speed up the development .. WTF .. I have seeing lot of scheduling related issues where the Win OS Scheduler doesn't do the right thing predictably and that makes some threads run wild and other sit around.. :( I tried enforcing CPU Affinity but that screws up the performance badly..
Jay D
@Gregory: Rephrased that way, it's much better. :-) Sorry, but what the OP confirmed in other comments didn't change anything with your old phrasing; it said that basically everyone was dropping inline assembly for x64, which was incorrect. Your new text is specifically referring to MS, which is more correct.
Ken White
@Ken well inline assembly blocks, with the given syntax could only be MS, that's why I originally answered like this :)
Gregory Pakosz
+6  A: 

If you're writing this for Windows, you should seriously consider using a critical section object. The critical section API functions are optimised such that they won't transition into kernel mode unless they really need to, so the normal case of no contention has very little overhead.

The biggest problem with your spin lock is that if you're on a single CPU system and you're waiting for the lock, then you're using all the cycles you can and whatever is holding the lock won't even get a chance to run until your timeslice is up and the kernel preempts your thread.

Using a critical section will be more successful than trying to roll your own user mode spin lock.

Greg Hewgill
I do NOT want to write windows specific code. That's the reason why i didn't used Interlocked and other stuff. I want to make it as much generic as possible. I am desperately waiting for the Linux move but that's not gonna happen at least for next couple of months.
Jay D
Writing assembly isn't going to help portability. If you want portable critical sections, code to a neutral API. Consider boost synchronization: http://www.boost.org/doc/libs/1_41_0/doc/html/thread/synchronization.html
plinth
I'm voting this down for the simple reason that it does not answer the question asked. The question is a valid one, and an answer to it would be good to see.
xcut
I'm voting it up because it's a BAD idea to spinlock on a system that can only run one thread at a time. Downvoting this is like saying "I don't care if it's a bad idea to shoot with kids downrange, I'm only going to hit the target I asked you to put up!"... Lastly, I don't understand why this logic can't be abstracted away so that whatever you change only has to be done once, in one little place. Obviously the plan is to do this with the current plan because otherwise the OP wouldn't be using a DWORD and writing inline assembly.
San Jacinto
It's OK to write a spinlock if you know your platform has multiple hardware threads. The default Window CS is basically a combination mutex + spinlock with a spin count. It tries to spin for the spin count before falling back to the OS mutex. On a single processor system, the spin count used is always zero which makes the default behavior there just a mutex.
Adisak
+1  A: 

You should be using something like this:

volatile LONG resourceLock = 1;

if(InterlockedCompareExchange(&resourceLock, 0, 1) == 1) {
    // success!
    // do something, and then
    resourceLock = 1;
} else {
    // failed, try again later
}

See InterlockedCompareExchange.

Fozi
His code should work perfectly well with just InterlockedExchange(). For an unchecked value swap, InterlockedExchange is a bit faster than InterlockedCompareExchange.
Adisak
Yes but you usually use CAS in this kind of situation and I wanted to bump him in the right direction.
Fozi
Oh and as far as I know both the intrinsic InterlockedCompareExchange and CompareExchange boil down to one single assembly instruction, so I wouldn't say that one is faster than the other. Well, ok, CAS might need a cycle more.
Fozi
Actually, CAS and XCHG are considerably different. For one, XCHG ALWAYS succeeds while CAS may fail. CAS usually requires additional code to pre-read the value before setting and if it fails you must do a retry or other code. And despite the fact that both single ASM instructions that are fairly slow due to BusLock, the XCHG can execute more quickly than CAS simply because you *ALWAYS* have to check if CAS succeeded. If your algorithm is simple enough to use XCHG, then XCHG is the preferred operation.
Adisak
+1  A: 

The main problems with the original version in the question is that it needs to use register indirect addressing and take a reference (or pointer parameter) rather than a by-value parameter for the lock DWORD.

Here's a working solution for Visual C++. EDIT: I have worked offline with the author and we have verified the code in this answer works in his test harness correctly.

But if you're using Windows, you should really by using the Interlocked API (i.e. InterlockedExchange).

Edit: As noted by CAF, lock xchg is not required because xchg automatically asserts a BusLock.

I also added a faster version that does a non-locking read before attempting to do the xchg. This significantly reduces BusLock contention on the memory interface. The algorithm can be sped up quite a bit more (in a contentious multithreaded case) by doing backoffs (yield then sleep) for locks held a long time. For the single-threaded-CPU case, using a OS lock that sleeps immediately on held-locks will be fastest.

class LockImpl
{
    // This is a simple SpinLock
    //  0 - in use / busy
    //  1 - free / available
public:
    static void lockResource(volatile DWORD &resourceLock )
    {
        __asm 
        {
            mov     ebx, resourceLock
InUseLoop:
            mov     eax, 0           ;0=In Use
            xchg    eax, [ebx]
            cmp     eax, 0
            je      InUseLoop
        }

    }

    static void lockResource_FasterVersion(DWORD &resourceLock )
    {
        __asm 
        {
            mov     ebx, resourceLock
InUseLoop:
            mov     eax, [ebx]    ;// Read without BusLock 
            cmp     eax, 0
            je      InUseLoop     ;// Retry Read if Busy

            mov     eax, 0
            xchg    eax, [ebx]    ;// XCHG with BusLock
            cmp     eax, 0
            je      InUseLoop     ;// Retry if Busy
        }
    }

    static void unLockResource(volatile DWORD &resourceLock)
    {
        __asm 
        {
            mov     ebx, resourceLock
            mov     [ebx], 1 
        }       

    }
};

// A little testing code here
volatile DWORD aaa=1;
void test()
{
 LockImpl::lockResource(aaa);
 LockImpl::unLockResource(aaa);
}
Adisak
`xchg` is supposed to imply a bus lock. However if you don't put a `lock` prefix on the `mov` in `unLockResource`, I think you will need an `mfence` instruction before it, to ensure all sideeffects from the critical section you just left are seen by the other processors before they're allowed to enter their critical sections.
caf
I tried your code and it does NOT work . What happens is that only first thread picks up the lock and its not able to release it. Secondly I was trying out the mfense/sfense + lock combination but that too does not seem to work.
Jay D
Have you verified your Program using this lock works with any other type of lock? I have run the above code on a QuadCore in a highly contentious testbed that does multithread modifications and independent verify under locks using 8 threads. This test bed is strong enough that it found a bug in the common Ruediger-Asche-ReaderWriterLock from the MSDN website. This code passed "correctness" functionality with flying colors although it was several orders of magnitude slower than the locking mechanism I actually use. It's possible your calling code is not correct ?
Adisak
lockTestThreadStatus_ = 1; while(lockTestThreadStatus_ ==1 ) { LockImpl::lockResource(commonResource->lockCnt); //===========Critical Section === commonResource->resource++; printf("The Thread ID : %ld resource value :%ld \n", GetCurrentThreadId(), commonResource->resource); //===========Critical Section End==== LockImpl::unLockResource(commonResource->lockCnt); }
Jay D
This is my calling code .. This is basically the thread function. I expect to see multiple THread ID incrementing this resource Counter one after the other. But with above code I only see one thread ID continuously doing the increment. Ironically the code that have posted shows the correct behavior thought i know the bug in it. :) This test i am running on dual core machine without hyperthreading
Jay D
@JayD: I just ran your sample code in my test bed and the resource counter is being incremented from multiple threads.
Adisak
@caf: No mfence is required. The xchg has an implicit lock/mfence. The x86 already enforces order for visibility of writes so any writes made while the lock is held will be visible to other cores before the write that releases the lock. Also, _ReadWriteBarrier is not required either since it's implied by the asm statement.
Adisak
FWIW: My lock testbed is a modified version of this: http://www.codeproject.com/KB/threads/testing_rwlocks.aspx I gave my modifications to the author but she has not yet posted them.
Adisak
@Adisak alright I hereby accept your ans even though somehow it is not working on the pc that i was running my test on ..! I'll try it out on other pcs
Jay D
@JayD: Send me an e-mail to gmail and I will e-mail you back the working code in the test harness. My username for e-mail is the same as it is here. I'm also trying to have Valery post the updated version of her test harness with the modifications I made to it.
Adisak
@JayD: I just tried my code in your test and it works perfectly. Locked work is done by both threads correctly. However, because printf is so slow, the lock is held much longer by the updating thread and is more likely to reaquire the lock than the waiting thread. This means that you see "bunching" where there are many accesses by thread1, then many accesses by thread2. This is not as "fair" behavior as a first-come-first-serve cs but spinlocks are never fair. You get much more fair behavior if you cache off the resource val and move the printf out of the locked code section
Adisak
@JayD: Note however that moving printf() out of the locked section might not work if your clib's version of printf() is not threadsafe. I moved it out of the CS and got nearly 50% distribution of work between the two threads. With printf() in the CS, there was as high as 99% work in one thread and 1% work in the other thread.
Adisak
@Adisak Thanks for your extended help. Yes i Agree to your analysis but not that the printf along with its GetThreadID() kernel call is just for the debugging purposes. Thanks for all the offline help.
Jay D
Second thing is that i ran this quick test i mentioned on a desktop that was a non production environment that has 32 bit Win XP with bunch of (unwanted ) windows services such as clearcase ( sql ) , browser , antivirus and stuff. so also the threads in the test were running with normal priority with out enforcing any cpu affinity on a non real time kernl (windoz duhh.. !! ) –
Jay D
@JayD: I'm just glad we were able to get this to work for you as I wrote it.
Adisak
me too got a negative feed back for some reason from someone and one thumbs down (?) .. Don't know whats going on.. !!
Jay D
+3  A: 

In terms of your actual question, it's pretty simple: just change the function headers to use volatile DWORD *resourceLock, and change the assembly lines that touch resourceLock to use indirection:

mov ecx, dword ptr [resourceLock]
xchg eax, dword ptr [ecx]

and

mov ecx, dword ptr [resourceLock]
lock mov dword ptr [ecx], 1

However, note that you've got a couple of other problems looming:

  • You say you're developing this on Windows, but want to switch to Linux. However, you're using MSVC-specific inline assembly - this will have to be ported to gcc-style when you move to Linux (in particular that involves switching from Intel syntax to AT&T syntax). You will be much better off developing with gcc even on Windows; that will minimise the pain of migration (see mingw for gcc for Windows).

  • Greg Hewgill is absolutely right about spinning uselessly, stopping the lock-holder from getting CPU. Consider yielding the CPU if you've been spinning for too long.

  • On a multiprocessor x86, you might well have a problem with memory loads and stores being re-ordered around your lock - mfence instructions in the lock and unlock procedures might be necessary.


Really, if you're worrying about locking that means you're using threading, which probably means you're using the platform-specific threading APIs already. So use the native synchronisation primitives, and switch out to the pthreads versions when you switch to Linux.

caf
This doesn't work for VC++ inline asm - at least not on my test in a debug build. You actually need a `mov ebx, resourceLock` followed by a `xchg eax, [ebx]` to do what he wants to do. Just `xchg eax, [resourceLock]` will actually swap eax with the value stored in the pointer (or address of the reference).
Adisak
Explanation: Let's actually say our pointer has a value (and it's dereferenced memory has a value too) - say `DWORD *resourceLock == 0x493004` and `*(DWORD*)(0x00493004) == 1`. Then after `mov ebx, resourceLock; xchg eax, [ebx]`, eax will equal 1. After `xchg eax, [resourceLock]`, eax will equal 0x493004 which is most definitely not what you want.
Adisak
Thanks, fixed it. It's been a long while since I used Intel syntax assembly, it's a bit rusty!
caf
@CAF: FWIW, the `lock mov` you have in the unlock code can just be a `mov`. X86 enforces visible write ordering so the second lock is unnecessary and in fact, only slows things down. The X86 only reorders 1) load-to-load and 2) loads-above-stores to a different address. There is no store reordering and stores are never moved above loads. Since this code uses a `xchg` (which has implicit lock/mfence) there is no need to add an mfence to the code. Any modifications while the lock is held will be visible before the unlock write is visible.
Adisak
@CAF: In the unlock code, since the mov doesn't have to be bus-locked it can be done in C with a simple assignment rather than ASM. However, taking it out of ASM means no compiler read-write barrier so you have to add one. Basically this works with the pointer: `_ReadWriteBarrier(); *resourceLock=1;` (with a reference, you eliminate the * before resourceLock).
Adisak
Adisak: The re-ordering I'm worry about is not to accesses to the lock variable itself - it's the possibility of loads of the shared data that's being protected by the lock being re-ordered out of the critical section.
caf
@CAF: Yes I realize that. However, the X86 enforces strict write ordering visibility even across different addresses. Therefore, if you modify any data while the lock is held, it's impossible for another CPU to see those modifications LATER than it sees the write to the LOCK variable during the UNLOCK. Therefore, this code is safe without a mfence. You do need a memory-fence on other CPU's like PowerPC but you DO NOT need one on X86 for the unlock. If you want an example look at code for the SpinLock in the X86 Intel Threaded Building Blocks (TBB).
Adisak
A: 

I already provided a working version which answered the original poster's question both on how to get the parameters passed in ASM and how to get his lock working correctly.

Many other answers have questioned the wiseness of using ASM at all and mentioned that either intrinsics or C OS calls should be used. The following works as well and is a C++ version of my ASM answer. There is a snippet of ASM in there that only needs to be used if your platform does not support InterlockedExchange().

class LockImpl
{
 // This is a simple SpinLock
 // 0 - in use / busy
 // 1 - free / available
public:
#if 1
 static DWORD MyInterlockedExchange(volatile DWORD *variable,DWORD newval)
 {
  // InterlockedExchange() uses LONG / He wants to use DWORD
  return((DWORD)InterlockedExchange(
      (volatile LONG *)variable,(LONG)newval));
 }
#else
 // You can use this if you don't have InterlockedExchange()
 // on your platform. Otherwise no ASM is required.
 static DWORD MyInterlockedExchange(volatile DWORD *variable,DWORD newval)
 {
  DWORD old;
  __asm 
  {
   mov  ebx, variable
   mov  eax, newval
   xchg eax, [ebx] ;// XCHG with BusLock
   mov  old, eax
  }
  return(old);
 }
#endif
 static void lockResource(volatile DWORD &resourceLock )
 {
  DWORD oldval;
  do 
  {
   while(0==resourceLock)
   {
    // Could have a yield, spin count, exponential 
    // backoff, OS CS fallback, etc. here
   }
   oldval=MyInterlockedExchange(&resourceLock,0);
  } while (0==oldval);
 }
 static void unLockResource(volatile DWORD &resourceLock)
 {
  // _ReadWriteBarrier() is a VC++ intrinsic that generates
  // no instructions / only prevents compiler reordering.
  // GCC uses __sync_synchronize() or __asm__ ( :::"memory" )
  _ReadWriteBarrier();
  resourceLock=1;
 }
};
Adisak
The questions was not about whether to use inline assembly, it was about passing a variable by reference to assembly from C instead of by value. Thanks for your extended code though.. !
Jay D
Wow... considering the top answer just says "Use Critical Sections" and doesn't answer the question at all, getting down-voted here seems a bit odd. Especially when many of the other answers are gripes about not using ASM and this is not just a follow-up but an ACTUAL working example that addresses those responses. I actually put some thought and work into this, contacting the author offline to help get things functional in his own code testbed, and then provided an additional non-ASM working version as well. I guess it doesn't pay to go out of your way to try to be helpful.
Adisak