I have some code that runs fairly well, but I would like to make it run better. The major problem I have with it is that it needs to have a nested for loop. The outer one is for iterations (which must happen serially), and the inner one is for each point particle under consideration. I know there's not much I can do about the outer one, but I'm wondering if there is a way of optimizing something like:
void collide(particle particles[], box boxes[],
double boxShiftX, double boxShiftY) {/*{{{*/
int i;
double nX;
double nY;
int boxnum;
for(i=0;i<PART_COUNT;i++) {
boxnum = ((((int)(particles[i].sX+boxShiftX))/BOX_SIZE)%BWIDTH+
BWIDTH*((((int)(particles[i].sY+boxShiftY))/BOX_SIZE)%BHEIGHT));
//copied and pasted the macro which is why it's kinda odd looking
particles[i].vX -= boxes[boxnum].mX;
particles[i].vY -= boxes[boxnum].mY;
if(boxes[boxnum].rotDir == 1) {
nX = particles[i].vX*Wxx+particles[i].vY*Wxy;
nY = particles[i].vX*Wyx+particles[i].vY*Wyy;
} else { //to make it randomly pick a rot. direction
nX = particles[i].vX*Wxx-particles[i].vY*Wxy;
nY = -particles[i].vX*Wyx+particles[i].vY*Wyy;
}
particles[i].vX = nX + boxes[boxnum].mX;
particles[i].vY = nY + boxes[boxnum].mY;
}
}/*}}}*/
I've looked at SIMD, though I can't find much about it, and I'm not entirely sure that the processing required to properly extract and pack the data would be worth the gain of doing half as many instructions, since apparently only two doubles can be used at a time.
I tried breaking it up into multiple threads with shm and pthread_barrier (to synchronize the different stages, of which the above code is one), but it just made it slower.
My current code does go pretty quickly; it's on the order of one second per 10M particle*iterations, and from what I can tell from gprof, 30% of my time is spent in that function alone (5000 calls; PART_COUNT=8192 particles took 1.8 seconds). I'm not worried about small, constant time things, it's just that 512K particles * 50K iterations * 1000 experiments took more than a week last time.
I guess my question is if there is any way of dealing with these long vectors that is more efficient than just looping through them. I feel like there should be, but I can't find it.