ansaurus

Question

Answer 1

+1 A:

Yes, SSE will not work here. You can improve this code performance on multi-core computer by using OpenMP:

void CompareArrays(const byte * p1Start, const byte * p1End, const byte * p2, byte * p3)
{
     const byte b1 = 128-30;
     const byte b2 = 128+30;

     int n = p1End - p1Start;
     #pragma omp parallel for
     for (int i = 0; i < n; ++p1, ++i) 
     {
        p3[i] = (p1[i] < p2[i]) ? b1 : b2;
     }
}

Alex Farber 2010-10-21 12:01:35

This will not work for single core or single CPU

VJo 2010-10-21 12:05:32

@ VJo - yes, of course. On single core computer this code performs exactly like original code from the question.

Alex Farber 2010-10-21 12:09:16

@VJo it will work but not give any boost

Andrey 2010-10-21 12:56:17

Answer 2

A:

use pcmpeqb and be the Power with you.

Alexander Solonsky 2010-10-21 12:06:54

`pcmpeqb` is a check for equal. I need less compare.

Alexey Malistov 2010-10-21 12:11:21

ah yes. then pcmpgtb. still use the Power. but wisely.

Alexander Solonsky 2010-10-21 12:16:52

OP needs unsigned comparison.

Michael Foukarakis 2010-10-21 12:47:21

Answer 3

+2 A:

You can subtract 127 from your numbers, and then use _mm_cmpgt_epi8

VJo 2010-10-21 12:13:39

It seems like right answer. But I think that your 127 must be replaced with 128. Or xor with 128.

Alexey Malistov 2010-10-21 12:22:22

The problem is I think there's only a packed add in MMX, which is a different register set entirely.

Crashworks 2010-10-21 12:23:17

yes, you are right. 128, not 127

VJo 2010-10-21 12:42:50

Answer 4

+2 A:

Yes, this can be done in SIMD, but it will take a few steps to make the mask.

Ruslik got it right, I think. You want to xor each component with 0x80 to flip the sense of the signed and unsigned comparison. _mm_xor_si128 (PXOR) gets you that -- you'll need to create the mask as a static char array somewhere before loading it into a SIMD register. Then _mm_cmpgt_epi8 gets you a mask and you can use a bitwise AND (eg _mm_and_si128) to perform a masked-move.

Crashworks 2010-10-21 12:19:07

Answer 5

A:

Ok, my previous answer was stupid, i didn't dig into sse intrinsics. You can emulate unsigned comparison by two signed.

Assume a, b, a > max signed (thus it is negative). a > b is true if (a > b) && ((a >> 1) > (b >> 1)), assuming comparison is signed. There are SSE shift operations.

Andrey 2010-10-21 12:45:14

Answer 6

+4 A:

Instead of offsetting your signed values to make them unsigned, a slightly more efficient way would be to do the following:

use _mm_min_epu8 to get the unsigned min of p1, p2
compare this min for equality with p2 using _mm_cmpeq_epi8
the resulting mask will now be 0x00 for elements where p1 < p2 and 0xff for elements where p1 >= p2
you can now use this mask with _mm_or_si128 and _mm_andc_si128 to select the appropriate b1/b2 values

Note that this is 4 instructions in total, compared with 5 using the offset + signed comparison approach.

Paul R 2010-10-21 12:58:18

ansaurus

tags:

views:

answers:

How to optimize a cycle?

related questions