views:

284

answers:

3

We're trying to compare two equally sized native arrays of signed int values using inequality operations, <, <=, > and >=, in a high performance way. As many values are compared, the true/false results would be sotred in a char array of the same size of the input, where 0x00 means false and 0xff means true.

To accomplish this, we're using the Intel IPP library. The problem is that the function we found that does this operation, named ippiCompare_*, from the images and video processing lib, supports only the types unsigned char (Ipp8u), signed/unsigned short (Ipp16s/Ipp16u) and float (Ipp32f). It does not directly support signed int (Ipp32s)

I (only) envision two possible ways of solving this:

  • Casting the array to one of the directly supported types and executing the comparison in more steps (it would became a short array of twice the size or a char array of four times the size) and merging intermediate results.

  • Using another function directly supporting signed int arrays from IPP or from another library that could do something equivalent in terms of performance.

But there may be other creative ways... So I ask you're help with that! :)

PS: The advantage of using Intel IPP is the performance gain for large arrays: it uses multi-value processor functions and many cores simultaneously (and maybe more tricks). So simple looped solutions wouldn't do it as fast AFAIK.

PS2: link for the ippiCompare_* doc

+1  A: 

I thought there is an SSE instruction that would compare integers. Have you look into the intrinsics that can do that?

Calyth
+1  A: 

You could do the comparison with PCMPEQD followed by a PACKUSDW and PACKUSWB. This would be something along

#include <emmintrin.h>

void cmp(__m128d* a, __m128d* b, v16qi* result, unsigned count) {
    for (unsigned i=0; i < count/16; ++i) {
        __m128d result0 = _mm_cmpeq_pd(a[0], b[0]);  // each line compares 4 integers
        __m128d result1 = _mm_cmpeq_pd(a[1], b[1]);
        __m128d result2 = _mm_cmpeq_pd(a[2], b[2]);
        __m128d result3 = _mm_cmpeq_pd(a[3], b[3]);
        a += 4; b+= 4;

        v8hi wresult0 = __builtin_ia32_packssdw(result0, result1);  //pack 2*4 integer results into 8 words
        v8hi wresult1 = __builtin_ia32_packssdw(result0, result1);

        *result = __builtin_ia32_packsswb(wresult0, wresult1);  //pack 2*8 word results into 16 bytes
        result++;
    }
}

Needs aligned pointers, a count divisible by 16, some typecasts I have omitted because of lazyness/stupidity and probably a lot of debugging, of course. And I didn't find the intrinsics for packssdw/wb, so I just used the builtins from my compiler.

drhirsch
I'm checking the MMX operations now and this seems a good way of accomplishing what I intended. I'm just unsure about multi-core use in this case: it seems it's not done "automagically", right?
Chuim
No. And since both cores share a part of the resources, like the last level of the cache and the memory, the memory bandwith is likely to become a bottleneck. For a routine so simple it is probably not worth the effort. Or worse, the performance could suffer for various reasons.
drhirsch
A: 

Backing out of the box for a bit: are you sure this is a performance problem? Unless your data set fits in L1 cache, you will be cache-fill limited and the actual cycles you're spending on your comparison operations (which are hardly slow even when done in the most naive way possible) can't possibly be limiting.

Andy Ross
You are right, for pure memory operations memory bandwith is usually the limiting factor. Nevertheless, even for a simple memory copy the sse instructions will outperform the "naive" way or string operations, if only by a small margin.Only a quarter of the execution units are occupied compared to the simple way, so hyperthreading possibly benefits a lot from the vector operations. Additionally he may choose to bypass the caches (non temporal mov instructions) if he has streaming data to avoid cache pollution.
drhirsch
I'll pay close attention to that too; thanks for the advice. Anyway, we'll run comparison tests between the two versions of each operation we're building, using IPP and simple scalar operations, to be sure about real performance gains.
Chuim