views:

136

answers:

5

I'm wondering if simply compiling my msvc project with sse/sse2 will have any effect. I do for example vector normalization and dot product but I do these wth math, not any specific function. Is there like sse_dot() and sse_normalize() that I should be using to actualyy take advantage, or will the compiler know?

Thanks

+2  A: 

Or you can avoid having to write SSE stuff explicitly by using a high performance library like Eigen, BLAS, Intel MKL, ... Unless you are working on embedded system, these libraries will be much better than whatever you can come up with.

Dat Chu
A: 

If you enable SSE2, then the compiler will use it behind your back. You will never notice nor need to know, unless you intend on supporting CPUs without SSE2. This is the same as any other underlying CPU instruction.

DeadMG
Simply using the SSE2 code path over x87 is nowhere near being able to auto-vectorize code. Auto-vectorization is a very tricky topic, and compilers are very poor at it (when its supported at all)
Yann Ramin
+5  A: 

As I understand it, using the sse2 compiler option will result in the compiler using the scalar not vector sse2 instructions in place of normal fpu code. I don't think it will do any vectorisation. The sse2 scalar stuff is quicker than fpu for sure.

To use the vector unit you need to use either intrinsics directly ( xmmintrin.h ) or use 3rd party libs that do. If you're just doing simple vector/matrix stuff for rendering, the Bullet SDK has an sse optimised vector math lib that's not bad. IIRC the DirectX/XNAmath lib is sse optimised too.

If neither of those take your fancy, Google should turn up a number of alternatives.

Andy J Buchanan
As a language extension gcc has builtin vector types and pseudo-functions that are realized with sse instructions if available. This should be portable as far as gcc runs on an architecture and should also apply to derived compilers like icc.
Jens Gustedt
A: 

Not all compilers are as smart as you thought. Even gcc might not always optimize the most obvious code. See the following example and try by yourself. Icc seems to be able to optimize the inner loop, but gcc, as I tried several settings, cannot. When necessary, you have to manually call SSE/SSE2 instructions by using SSE functions. People told me this is a good tutorial.

EDIT: The following example works with Mac/Linux gcc. But it fails icc on linux. I do not know why. BTW, icc is believed to be better than gcc on vectorization.

#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <math.h>
#include <emmintrin.h>

float **mm_init(int n)
{
    float **m;
    int i;
    m = (float**)malloc(n * sizeof(void*));
    for (i = 0; i < n; ++i)
        m[i] = calloc(n, sizeof(float));
    return m;
}
void mm_destroy(int n, float **m)
{
    int i;
    for (i = 0; i < n; ++i) free(m[i]);
    free(m);
}
float **mm_gen(int n)
{
    float **m;
    int i, j;
    m = mm_init(n);
    for (i = 0; i < n; ++i)
        for (j = 0; j < n; ++j)
            m[i][j] = 2 * drand48() - 1.0;
    return m;
}
// better cache performance by transposing the second matrix
float **mm_mul2(int n, float *const *a, float *const *b)
{
    int i, j, k;
    float **m, **c;
    m = mm_init(n); c = mm_init(n);
    for (i = 0; i < n; ++i) // transpose
        for (j = 0; j < n; ++j)
            c[i][j] = b[j][i];
    for (i = 0; i < n; ++i) {
        float *p = a[i], *q = m[i];
        for (j = 0; j < n; ++j) {
            float t = 0.0, *r = c[j];
            for (k = 0; k < n; ++k)
                t += p[k] * r[k];
            q[j] = t;
        }
    }
    mm_destroy(n, c);
    return m;
}
// explicit SSE optimization for the inner loop
float **mm_mul3(int n, float *const *a, float *const *b)
{
    int i, j, k;
    float **m, **c, x[4];
    m = mm_init(n); c = mm_init(n);
    for (i = 0; i < n; ++i) // transpose
        for (j = 0; j < n; ++j)
            c[i][j] = b[j][i];
    for (i = 0; i < n; ++i) {
        float *p = a[i], *q = m[i];
        for (j = 0; j < n; ++j) {
            __m128 t = _mm_setzero_ps();
            float *r = c[j];
            for (k = 0; k < n; k += 4) // four operations in one CPU cycle
                t = _mm_add_ps(t, _mm_mul_ps(_mm_load_ps(p+k), _mm_load_ps(r+k)));
            _mm_store_ps(x, t);
            q[j] = x[0] + x[1] + x[2] + x[3];
        }
    }
    mm_destroy(n, c);
    return m;
}

int main(int argc, char *argv[])
{
    int n = 100;
    float **a, **b, **m;
    clock_t t;
    if (argc > 1) n = atoi(argv[1]);
    n = (n + 3) / 4 * 4; // for simplicity, n can be divided by 4
    srand48(11);
    a = mm_gen(n); b = mm_gen(n);

    t = clock();
    m = mm_mul2(n, a, b);
    fprintf(stderr, "cache:  %lf sec; M[%d][%d]=%f\n", (double)(clock() - t) / CLOCKS_PER_SEC, n/2, n/2, m[n/2][n/2]);

    t = clock();
    m = mm_mul3(n, a, b);
    fprintf(stderr, "SSE:    %lf sec; M[%d][%d]=%f\n", (double)(clock() - t) / CLOCKS_PER_SEC, n/2, n/2, m[n/2][n/2]);

    mm_destroy(n, a); mm_destroy(n, b); mm_destroy(n, m);
    return 0;
}
+1  A: 

Either write SSE code yourself (asm or intrinsics), use third party SSE-optimised libraries (e.g. IPP, MKL, etc), or use an auto-vectorizing compiler such as Intel's ICC.

Paul R