views:

92

answers:

3

I'd like to optimize the following snippet using SSE instructions if possible:

/*
 * the data structure
 */
typedef struct v3d v3d;
struct v3d {
    double x;
    double y;
    double z;
} tmp = { 1.0, 2.0, 3.0 };

/*
 * the part that should be "optimized"
 */
tmp.x /= 4.0;
tmp.y /= 4.0;
tmp.z /= 4.0;

Is this possible at all?

A: 

Is tmp.x *= 0.25; enough?

Note that for SSE instructions (in case that you want to use them) it's important that:

1) all the memory access is 16 bytes alighed

2) the operations are performed in a loop

3) no int <-> float or float <-> double conversions are performed

4) avoid divisions if possible

ruslik
no. in my application 4.0 will be replaced by a variable.
guest
why do I need a loop? that it pays off?
guest
Anyway, you'll have to post more from you code, not just a line with division if you need help.
ruslik
Sometime avoiding division is incorrect. For instance, if the number were 5 instead of 4, multiplying by 0.2 instead of dividing by 5.0 is incorrect (it will produce blatantly wrong results) because there is no such floating point number as 0.2 (the closest floating point number to 0.2 is slightly less, i.e. 0.19999999999...).
R..
@R you're right. But considering the amount of details `guest` gave it could be a viable option.
ruslik
R: Is there any instance where dividing will give a different answer than multiplying by the reciprocal?
Gabe
@Gabe: (even though I'm not R) for floating point calculations the answer two that question is almost always (probably except dividing by 2^n), however whether or not the difference between those two results matters is a totally different story. Multiplaying by 0.2 instead of dividing by 5.0 will produce a different reuslt, however I wouldn't call it blalantly wrong. More to the point I couldn't really say adhoc which of those is closer to the correct result (0.2 has maybe 1ups more error, but mul has generally a smaller error then div, so its about the same)
Grizzly
Grizzly: Considering that most commercial FPUs implement division as multiplication by reciprocal, it will be rare to find a situation where division gives you a different answer than multiplying by the precomputed reciprocal. On my Intel CPU, 12.0/5.0 gives me 2.3999999999999999 while 12.0*0.2 gives me 2.4000000000000004 (1ups difference).
Gabe
+2  A: 

I've used SIMD extension under windows, but have not yet under linux. That being said you should be able to take advantage of the DIVPS SSE operation which will divide a 4 float vector by another 4 float vector. But you are using doubles, so you'll want the SSE2 version DIVPD. I almost forgot, make sure to build with -msse2 switch.

I found a page which details some SSE GCC builtins. It looks kind of old, but should be a good start.

http://ds9a.nl/gcc-simd/

jay.lee
I can't seem to find the right GCC builtin for DIVPD, though.
guest
Here's a comprehensive list straight http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/X86-Built_002din-Functions.html
jay.lee
v2df __builtin_ia32_divpd (v2df, v2df), seems to be what I was looking for. thanks.
guest
Recommend you use the intrinsics, _not_ the builtins, as they're more portable, and deprecate SIMD builtins: http://www.codeproject.com/KB/recipes/sseintro.aspx
Matt Joiner
A: 

The intrinsic you are looking for is _mm_div_pd. Here is a working example which should be enough to steer you in the right direction:

#include <stdio.h>

#include <emmintrin.h>

typedef struct
{
    double x;
    double y;
    double z;
} v3d;

typedef union __attribute__ ((aligned(16)))
{
    v3d a;
    __m128d v[2];
} u3d;

int main(void)
{
    const __m128d vd = _mm_set1_pd(4.0);
    u3d u = { { 1.0, 2.0, 3.0 } };

    printf("v (before) = { %g %g %g }\n", u.a.x, u.a.y, u.a.z);

    u.v[0] = _mm_div_pd(u.v[0], vd);
    u.v[1] = _mm_div_pd(u.v[1], vd);

    printf("v (after) = { %g %g %g }\n", u.a.x, u.a.y, u.a.z);

    return 0;
}
Paul R