views:

424

answers:

4

I want to save memory by converting an existing 32 bit counter to a 16 bit counter. This counter is atomically incremented/decremented. If I do this:

  1. What instructions do I use for atomic_inc(uint16_t x) on x86/x86_64?
  2. Is this reliable in multi-processor x86/x86_64 machines?
  3. Is there a performance penalty to pay on any of these architectures for doing this?
  4. If yes for (3), what's the expected performance penalty?

Thanks for your comments!

+3  A: 

Here is a Delphi function that works:

function LockedInc( var Target :WORD ) :WORD;
asm
        mov     ecx, eax
        mov     ax, 1
   Lock xadd    [ecx], ax
        Inc     eax
end;

I guess you could convert it to whichever language you require.

Steve
To clarify for non Delphi/BASM users, I would add that in this (32 bit) routine, a pointer to Target will be passed in in EAX, and that the return value of the function will be in AX.
PhiS
+3  A: 

Here's one that uses GCC assembly extensions, as an alternative to Steve's Delphi answer:

uint16_t atomic_inc(uint16_t volatile* ptr)
{
    uint16_t value(1);
    __asm__("lock xadd %w0, %w1" : "+r" (value) : "m" (*ptr));
    return ++value;
}

Change the 1 with -1, and the ++ with --, for decrement.

Chris Jester-Young
Thanks, but I am really talking about AMD64/Pentiums here. :)
Sudhanshu
That's cool. I'm still giving you a C alternative in case you don't want to code with Delphi. :-)
Chris Jester-Young
(Well, change the initializers from `uint_t value(1)` to `uint_t value = 1` for C (my C++ habits are getting to me), but yeah. :-P)
Chris Jester-Young
A: 

To answer the other three questions:

  1. Didn't find a way to make a numbered list starting with 2
  2. Yes, this is reliable in a multiprocessor environment
  3. Yes, there is a performance penalty
  4. The "lock" prefix locks down the busses, not only for the processor, but for any external hardware, which may want to access the bus via DMA (mass storage, graphics...). So it is slow, typically ~100 clock cycles, but it may be more costly. But if you have "megabytes" of counters, chances are, you will be facing a cache miss, and in this case you will have to wait about ~100 clocks anyway (the memory access time), in case of a page miss, several hundred, so the overhead from lock might not matter.
drhirsch
Thanks for your answer. It comes closest to what I was looking for after Chris's reply.
Sudhanshu
I believe the bus locking is a thing of days of yore. Current generation processors do cache line locking instead: that also ties in neatly with the MESI or MESI like protocol used for cache coherence.
terminus
A: 

The simplest way to perform an atomic increase is as follows (this is inline ASM):

asm
  lock inc dword ptr Counter;
end;

where J is an integer. This will directly increase Counter in its memory location.

I have tested this with brute force and it works 100%.

IanC