ansaurus

Question

Is SIMD Worth It? Is there a better option?

Answer 1

+1 A:

Do you have sufficient profiling to tell you where the time is spent within that function?

For instance, are you sure it's not your divs and mods in the boxnum calculation where the time is being spent? Sometimes compilers fail to spot possible shift/AND alternatives, even where a human (or at least, one who knew BOX_SIZE and BWIDTH/BHEIGHT, which I don't) might be able to.

It would be a pity to spend lots of time on SIMDifying the wrong bit of the code...

The other thing which might be worth looking into is if the work can be coerced into something which could work with a library like IPP, which will make well-informed decisions about how best to use the processor.

Will Dean 2010-07-18 19:02:41

Honestly, it probably *is* the divs and mods, but no; I have yet to find a profiler that will tell me that. For my current experiment, BOX_SIZE has been 1, and you have a good point: BWIDTH, BHEIGHT have been powers of two. Do you have a suggestion for a more fine-grained profiler?

zebediah49 2010-07-18 19:05:20

I would expect any sampling profiler to be able to give you per-line info, though of course the compiler optimisation makes line matching a little imprecise. Intel vTune will give you information even more finely-grained than a single assembler instruction, so that might be the way to go if that's what you feel you want to see.Personally, for something simple (i.e. small) like this I tend to time the code over lots of runs and then hack about with it to see what's taking the time.

Will Dean 2010-07-18 22:02:00

Answer 2

+2 A:

((int)(particles[i].sX+boxShiftX))/BOX_SIZE

That's expensive if sX is an int (can't tell). Truncate boxShiftX/Y to an int before entering the loop.

Hans Passant 2010-07-18 19:40:22

Unfortunately, both sX and boxShiftX are doubles, and the point of it is to effectively randomize rounding (boxShiftX is in the range [-.5,.5])

zebediah49 2010-07-18 20:55:01

I dunno, I usually go wtf when floating point numbers need to be truncated and taken modulo. That's a sign of an integer problem being obfuscated with perceived accuracy. Once you go there, turning the floating point numbers into integers by scaling usually pays off big. The end result of code like this tends to be integer, maybe a pixel on the screen. Integer results should have integer math. Sorry, I just don't know what you are really trying to do to be more helpful.

Hans Passant 2010-07-18 21:55:17

I have this set of particles, and am sorting them into 'boxes'. Due to a quirk of the simulation though, the location of the boxes has to jump around every timestep, which is why that happens.

zebediah49 2010-07-18 21:56:09

Answer 3

+3 A:

I'm not sure how much SIMD would benefit; the inner loop is pretty small and simple, so I'd guess (just by looking) that you're probably more memory-bound than anything else. With that in mind, I'd try rewriting the main part of the loop to not touch the particles array more than needed:

const double temp_vX = particles[i].vX - boxes[boxnum].mX;
const double temp_vY = particles[i].vY - boxes[boxnum].mY;

if(boxes[boxnum].rotDir == 1)
{
    nX = temp_vX*Wxx+temp_vY*Wxy;
    nY = temp_vX*Wyx+temp_vY*Wyy;
}
else
{
    //to make it randomly pick a rot. direction
    nX =  temp_vX*Wxx-temp_vY*Wxy;
    nY = -temp_vX*Wyx+temp_vY*Wyy;
}   
particles[i].vX = nX;
particles[i].vY = nY;

This has the small potential side effect of not doing the extra addition at the end.

Another potential speedup would be to use __restrict on the particle array, so that the compiler can better optimize the writes to the velocities. Also, if Wxx etc. are global variables, they may have to get reloaded each time through the loop instead of possibly stored in registers; using __restrict would help with that too.

Since you're accessing the particles in order, you can try prefetching (e.g. __builtin_prefetch on GCC) a few particles ahead to reduce cache misses. Prefetching on the boxes is a bit tougher since you're accessing them in an unpredictable order; you could try something like

int nextBoxnum = ((((int)(particles[i+1].sX+boxShiftX) /// etc...
// prefetch boxes[nextBoxnum]

One last one that I just noticed - if box::rotDir is always +/- 1.0, then you can eliminate the comparison and branch in the inner loop like this:

const double rot = boxes[boxnum].rotDir; // always +/- 1.0
nX =     particles[i].vX*Wxx + rot*particles[i].vY*Wxy;
nY = rot*particles[i].vX*Wyx +     particles[i].vY*Wyy;

Naturally, the usual caveats of profiling before and after apply. But I think all of these might help, and can be done regardless of whether or not you switch to SIMD.

celion 2010-07-18 21:49:29

Thanks for accepting my answer. How much did any of those help?

celion 2010-07-21 19:48:03

Answer 4

+1 A:

Just for the record, there's also libSIMDx86!

http://simdx86.sourceforge.net/Modules.html

(On compiling you may also try: gcc -O3 -msse2 or similar).

cigit 2010-07-19 08:33:21

ansaurus

tags:

views:

answers:

Is SIMD Worth It? Is there a better option?

related questions