ansaurus

Question

Answer 1

+1 A:

Notice that the given example decides what to execute at compile time (since you're using the preprocessor), in this case you can use more complex techniques to decide what you actually want to execute; For example, Tag Dispatch: http://cplusplus.co.il/2010/01/03/tag-dispatching/ Following the example shown there, you could have the fast implementation be with SIMD, and the slow without.

rmn 2010-01-23 08:16:15

There are 3 problems. 1. Makes decision during compile time2. Requires multiple implementations3. Desiosion is based on the input data, where I want the same data to be used in SIMD and non SIMD

Aleks 2010-01-23 08:43:13

Answer 2

+2 A:

You might want to look at the source for the MacSTL library for some ideas in this area: www.pixelglow.com/macstl/

Paul R 2010-01-23 08:41:34

There are a lot of templates in MacSTL. I'll need time to figure out how it is implemented. But while reading about it I come up with idea which seems to work. I'll post some dirty code as a new answer if someone else is curious...

Aleks 2010-01-23 09:51:02

I'd be interested to see any code that you come up with. I already have a partial solution for this kind of problem but unfortunately it's proprietary (the IP belongs to my employer).

Paul R 2010-01-23 18:49:54

Answer 3

+2 A:

If someone is interested this is the dirty code I come with to test a new idea that I came with while reading about the library that Paul posted.

Thanks Paul!

// This is just a conceptual test
// I haven't profile the code and I haven't verified if the result is correct
#include <xmmintrin.h>


// This class is doing all the math
template <bool SIMD>
class cStreamF32
{
private:
    void*       m_data;
    void*       m_dataEnd;
    __m128*     m_current128;
    float*      m_current32;

public:
    cStreamF32(int size)
    {
        if (SIMD)
            m_data = _mm_malloc(sizeof(float) * size, 16);
        else
            m_data = new float[size];
    }
    ~cStreamF32()
    {
        if (SIMD)
            _mm_free(m_data);
        else
            delete[] (float*)m_data;
    }

    inline void Begin()
    {
        if (SIMD)
            m_current128 = (__m128*)m_data;
        else
            m_current32 = (float*)m_data;
    }

    inline bool Next()
    {
        if (SIMD)
        {
            m_current128++;
            return m_current128 < m_dataEnd;
        }
        else
        {
            m_current32++;
            return m_current32 < m_dataEnd;
        }
    }

    inline void operator=(const __m128 x)
    {
        *m_current128 = x;
    }
    inline void operator=(const float x)
    {
        *m_current32 = x;
    }

    inline __m128 operator+(const cStreamF32<true>& x)
    {
        return _mm_add_ss(*m_current128, *x.m_current128);
    }
    inline float operator+(const cStreamF32<false>& x)
    {
        return *m_current32 + *x.m_current32;
    }

    inline __m128 operator+(const __m128 x)
    {
        return _mm_add_ss(*m_current128, x);
    }
    inline float operator+(const float x)
    {
        return *m_current32 + x;
    }

    inline __m128 operator*(const cStreamF32<true>& x)
    {
        return _mm_mul_ss(*m_current128, *x.m_current128);
    }
    inline float operator*(const cStreamF32<false>& x)
    {
        return *m_current32 * *x.m_current32;
    }

    inline __m128 operator*(const __m128 x)
    {
        return _mm_mul_ss(*m_current128, x);
    }
    inline float operator*(const float x)
    {
        return *m_current32 * x;
    }
};

// Executes both functors
template<class T1, class T2>
void Execute(T1& functor1, T2& functor2)
{
    functor1.Begin();
    do
    {
        functor1.Exec();
    }
    while (functor1.Next());

    functor2.Begin();
    do
    {
        functor2.Exec();
    }
    while (functor2.Next());
}

// This is the implementation of the problem
template <bool SIMD>
class cTestFunctor
{
private:
    cStreamF32<SIMD> a;
    cStreamF32<SIMD> b;
    cStreamF32<SIMD> c;

public:
    cTestFunctor() : a(1024), b(1024), c(1024) { }

    inline void Exec()
    {
        c = a + b * a;
    }

    inline void Begin()
    {
        a.Begin();
        b.Begin();
        c.Begin();
    }

    inline bool Next()
    {
        a.Next();
        b.Next();
        return c.Next();
    }
};


int main (int argc, char * const argv[]) 
{
    cTestFunctor<true> functor1;
    cTestFunctor<false> functor2;

    Execute(functor1, functor2);

    return 0;
}

Aleks 2010-01-23 10:00:30

Answer 4

+2 A:

You might want to take a glance at my attempt at SIMD/non-SIMD:

vrep, a templated base class with specializations for SIMD (note how it distinguishes between floats-only SSE, and SSE2, which introduced integer vectors.).
More useful v4f, v4i etc classes (subclassed via intermediate v4).

Of course it's far more geared towards 4-element vectors for rgba/xyz type calculations than SoA, so will completely run out of steam when 8-way AVX comes along, but the general principles might be useful.

timday 2010-01-23 14:28:52

It's interesting approach I really need SoA in my case. But might try to do template specialization as well.

Aleks 2010-01-23 17:40:23

Answer 5

A:

The most impressive approach to SIMD-scaling I've seen is the RTFact ray-tracing framework: slides, paper. Well worth a look. The researchers are closely associated with Intel (Saarbrucken now hosts the Intel Visual Computing Institute) so you can be sure forward scaling onto AVX and Larrabee was on their minds.

Intel's Ct "data parallelism" template library looks quite promising too.

timday 2010-01-23 14:39:22

Answer 6

A:

Have you thought about using existing solutions like liboil? It implements lots of common SIMD operations and can decide at runtime whether to use SIMD/non-SIMD code (using function pointers assigned by an initialization function).

AndiDog 2010-01-23 14:57:27

I still need to check how this libraries is using function pointers to switch between SIMD/non-SIMD, but I don't see how these function pointers will be inlined. The other problem I noticed is that all the arithmetic functions are implemented with their own loop. for(i=0;i<(ni+=4){ ... } for(;i<n;i++){ ... }And for something like A + B + C it needs to loop through all elements twice

Aleks 2010-01-23 17:37:55

ansaurus

tags:

views:

answers:

SIMD or not SIMD - cross platform

related questions