ansaurus

Question

High performance comparison of signed int arrays (using Intel IPP library)

Answer 1

+1 A:

I thought there is an SSE instruction that would compare integers. Have you look into the intrinsics that can do that?

Calyth 2009-10-16 20:49:57

Answer 2

+1 A:

You could do the comparison with PCMPEQD followed by a PACKUSDW and PACKUSWB. This would be something along

#include <emmintrin.h>

void cmp(__m128d* a, __m128d* b, v16qi* result, unsigned count) {
    for (unsigned i=0; i < count/16; ++i) {
        __m128d result0 = _mm_cmpeq_pd(a[0], b[0]);  // each line compares 4 integers
        __m128d result1 = _mm_cmpeq_pd(a[1], b[1]);
        __m128d result2 = _mm_cmpeq_pd(a[2], b[2]);
        __m128d result3 = _mm_cmpeq_pd(a[3], b[3]);
        a += 4; b+= 4;

        v8hi wresult0 = __builtin_ia32_packssdw(result0, result1);  //pack 2*4 integer results into 8 words
        v8hi wresult1 = __builtin_ia32_packssdw(result0, result1);

        *result = __builtin_ia32_packsswb(wresult0, wresult1);  //pack 2*8 word results into 16 bytes
        result++;
    }
}

Needs aligned pointers, a count divisible by 16, some typecasts I have omitted because of lazyness/stupidity and probably a lot of debugging, of course. And I didn't find the intrinsics for packssdw/wb, so I just used the builtins from my compiler.

drhirsch 2009-10-16 20:54:49

I'm checking the MMX operations now and this seems a good way of accomplishing what I intended. I'm just unsure about multi-core use in this case: it seems it's not done "automagically", right?

Chuim 2009-10-19 12:48:15

No. And since both cores share a part of the resources, like the last level of the cache and the memory, the memory bandwith is likely to become a bottleneck. For a routine so simple it is probably not worth the effort. Or worse, the performance could suffer for various reasons.

drhirsch 2009-10-19 15:22:22

Answer 3

A:

Backing out of the box for a bit: are you sure this is a performance problem? Unless your data set fits in L1 cache, you will be cache-fill limited and the actual cycles you're spending on your comparison operations (which are hardly slow even when done in the most naive way possible) can't possibly be limiting.

Andy Ross 2009-10-16 22:03:14

You are right, for pure memory operations memory bandwith is usually the limiting factor. Nevertheless, even for a simple memory copy the sse instructions will outperform the "naive" way or string operations, if only by a small margin.Only a quarter of the execution units are occupied compared to the simple way, so hyperthreading possibly benefits a lot from the vector operations. Additionally he may choose to bypass the caches (non temporal mov instructions) if he has streaming data to avoid cache pollution.

drhirsch 2009-10-17 01:56:17

I'll pay close attention to that too; thanks for the advice. Anyway, we'll run comparison tests between the two versions of each operation we're building, using IPP and simple scalar operations, to be sure about real performance gains.

Chuim 2009-10-19 12:31:12

ansaurus

tags:

views:

answers:

High performance comparison of signed int arrays (using Intel IPP library)

related questions