293

7
+4  Q:

## Compute the absolute difference between unsigned integers using SSE

In C is there a branch-less technique to compute the absolute difference between two unsigned ints? For example given the variables a and b, I would like the value 2 for cases when a=3, b=5 or b=3, a=5. Ideally I would also like to be able to vectorize the computation using the SSE registers.

A:

a xor b?

I can't remember if C++ has an XOR operator now -- I think it's `a ^ b`.

that does not seem right
Rofl, indeed this is the bitwise difference. Some of us like to carry the 1's, though.
@Charlie Martin, The OP wants the arithmetic different, not the bit-wise difference (i.e. the differences in bits).
Then that's what they should have asked, isn't it?
+1  A:
``````max(i,j) - min(i,j)
(i>j)*(i-j) + (j>i)*(j-i)
``````

you can certainly use SSE registers, but compiler may do this for you anyways

+5  A:

There are several ways to do it, I'll just mention one:

SSE4

• Use `PMINUD` and `PMAXUD` to separate the larger value in register #1, and the smaller value in register #2.
• Subtract them.

MMX/SSE2

• Flip the sign bit of the two values because the next instruction only accepts signed integer comparison.
• `PCMPGTD`. Use this result as a mask.
• Compute the results of both `(a-b)` and `(b-a)`
• Use `POR ( PAND ( mask, a-b ), PANDN ( mask, b-a ) )` to select the correct value for the absolute difference.
+2  A:

Try this (assumes 2nd complements, which is OK judgning by the fact that you're asking for SSE):

``````int d = a-b;
int ad = ((d >> 30) | 1) * d;
``````

Explanation: sign-bit (bit 31) gets propagated down to 1st bit. the | 1 part ensures that the multiplier is either 1 or -1. Multiplications are fast on modern CPUs.

+2  A:

compute the difference and return the absolute value

``````__m128i diff = _mm_sub_epi32(a, b);
``````

This requires one less operation that using the signed compare op, and produces less register pressure.

Same amount of register pressure as before, 2 more ops, better branch and merging of dependency chains, instruction pairing for uops decoding, and separate unit utilization. Although this requires a load, which may be out of cache. I'm out of ideas after this one.

``````__m128i mask, diff;
diff = _mm_set1_epi32(-1<<31); // dependency branch after
a = _mm_add_epi32(a, diff); // arithmetic sign flip
b = _mm_xor_si128(b, diff); // bitwise sign flip parallel with 'add' unit
diff = _mm_xor_si128(a, b); // reduce uops, instruction already decoded
mask = _mm_cmpgt_epi32(b, a); // parallel with xor
a = _mm_xor_si128(a, mask); // if 2 'bit' units in CPU, parallel with next
diff = _mm_sub_epi32(a, b); // result
``````

After timing each version with 2 million iterations on a Core2Duo, differences are immeasurable. So pick whatever is easier to understand.

Is `sum` supposed to be `diff`? Bah, now that I've read yours closely it's quite similar to mine. But more clever, nice on using the signed difference as a signed comparison. Comparison with zero is generally lighter-weight than right-shifting, though.
Actually, we both made a mistake: in the first function, a three-input consensus function is needed, not three-way XOR.
+1  A:

SSE2:

Seems to be about the same speed as Phernost's second function. Sometimes GCC schedules it to be a full cycle faster, other times a little slower.

``````__m128i big = _mm_set_epi32( INT_MIN, INT_MIN, INT_MIN, INT_MIN );

a = _mm_add_epi32( a, big ); // re-center the variables: send 0 to INT_MIN,
b = _mm_add_epi32( b, big ); // INT_MAX to -1, etc.
__m128i diff = _mm_sub_epi32( a, b ); // get signed difference
__m128i mask = _mm_cmpgt_epi32( b, a ); // mask: need to negate difference?
diff = _mm_xor_si128( diff, mask ); // 1's complement except MSB
diff = _mm_sub_epi32( diff, mask ); // add 1 and restore MSB
``````

SSSE3:

Ever so slightly faster than previous. There is a lot of variation depending on how things outside the loop are declared. (For example, making `a` and `b` `volatile` makes things faster! It appears to be a random effect on scheduling.) But this is consistently fastest by a cycle or so.

``````__m128i big = _mm_set_epi32( INT_MIN, INT_MIN, INT_MIN, INT_MIN );

a = _mm_add_epi32( a, big ); // re-center the variables: send 0 to INT_MIN,
b = _mm_add_epi32( b, big ); // INT_MAX to -1, etc.
__m128i diff = _mm_sub_epi32( b, a ); // get reverse signed difference
__m128i mask = _mm_cmpgt_epi32( b, a ); // mask: need to negate difference?
diff = _mm_sign_epi32( diff, mask ); // negate diff if needed
``````

SSE4 (thx rwong):

Can't test this.

``````__m128i diff = _mm_sub_epi32( _mm_max_epu32( a, b ), _mm_min_epu32( a, b ) );
``````
A:

Erm ... its pretty easy ...

``````int diff = abs( a - b );
``````

Easily vectorisable (Using SSE3 as):

``````__m128i sseDiff         = _mm_abs_epi32( _mm_sub_epi32( a, b ) );
``````