views:

168

answers:

6

I have the following bottleneck function.

typedef unsigned char byte;
void CompareArrays(const byte * p1Start, const byte * p1End, const byte * p2, byte * p3)
{
     const byte b1 = 128-30;
     const byte b2 = 128+30;
     for (const byte * p1 = p1Start; p1 != p1End; ++p1, ++p2, ++p3) {
        *p3 = (*p1 < *p2 ) ? b1 : b2;
    }
}

I want to replace C++ code with SSE2 intinsic functions. I have tried _mm_cmpgt_epi8 but it used signed compare. I need unsigned compare.

Is there any trick (SSE, SSE2, SSSE3) to solve my problem?

Note: I do not want to use multi-threading in this case.

+1  A: 

Yes, SSE will not work here. You can improve this code performance on multi-core computer by using OpenMP:

void CompareArrays(const byte * p1Start, const byte * p1End, const byte * p2, byte * p3)
{
     const byte b1 = 128-30;
     const byte b2 = 128+30;

     int n = p1End - p1Start;
     #pragma omp parallel for
     for (int i = 0; i < n; ++p1, ++i) 
     {
        p3[i] = (p1[i] < p2[i]) ? b1 : b2;
     }
}
Alex Farber
This will not work for single core or single CPU
VJo
@ VJo - yes, of course. On single core computer this code performs exactly like original code from the question.
Alex Farber
@VJo it will work but not give any boost
Andrey
A: 

use pcmpeqb and be the Power with you.

Alexander Solonsky
`pcmpeqb` is a check for equal. I need less compare.
Alexey Malistov
ah yes. then pcmpgtb. still use the Power. but wisely.
Alexander Solonsky
OP needs unsigned comparison.
Michael Foukarakis
+2  A: 

You can subtract 127 from your numbers, and then use _mm_cmpgt_epi8

VJo
It seems like right answer. But I think that your 127 must be replaced with 128. Or xor with 128.
Alexey Malistov
The problem is I think there's only a packed add in MMX, which is a different register set entirely.
Crashworks
yes, you are right. 128, not 127
VJo
+2  A: 

Yes, this can be done in SIMD, but it will take a few steps to make the mask.

Ruslik got it right, I think. You want to xor each component with 0x80 to flip the sense of the signed and unsigned comparison. _mm_xor_si128 (PXOR) gets you that -- you'll need to create the mask as a static char array somewhere before loading it into a SIMD register. Then _mm_cmpgt_epi8 gets you a mask and you can use a bitwise AND (eg _mm_and_si128) to perform a masked-move.

Crashworks
A: 

Ok, my previous answer was stupid, i didn't dig into sse intrinsics. You can emulate unsigned comparison by two signed.

Assume a, b, a > max signed (thus it is negative). a > b is true if (a > b) && ((a >> 1) > (b >> 1)), assuming comparison is signed. There are SSE shift operations.

Andrey
+4  A: 

Instead of offsetting your signed values to make them unsigned, a slightly more efficient way would be to do the following:

  • use _mm_min_epu8 to get the unsigned min of p1, p2
  • compare this min for equality with p2 using _mm_cmpeq_epi8
  • the resulting mask will now be 0x00 for elements where p1 < p2 and 0xff for elements where p1 >= p2
  • you can now use this mask with _mm_or_si128 and _mm_andc_si128 to select the appropriate b1/b2 values

Note that this is 4 instructions in total, compared with 5 using the offset + signed comparison approach.

Paul R