I'm working on a bit of code and I'm trying to optimize it as much as possible, basically get it running under a certain time limit.
The following makes the call...
static affinity_partitioner ap;
parallel_for(blocked_range<size_t>(0, T), LoopBody(score), ap);
... and the following is what is executed.
void operator()(const blocked_range<size_t> &r) const {
int temp;
int i;
int j;
size_t k;
size_t begin = r.begin();
size_t end = r.end();
for(k = begin; k != end; ++k) { // for each trainee
temp = 0;
for(i = 0; i < N; ++i) { // for each sample
int trr = trRating[k][i];
int ei = E[i];
for(j = 0; j < ei; ++j) { // for each expert
temp += delta(i, trr, exRating[j][i]);
}
}
myscore[k] = temp;
}
}
I'm using Intel's TBB to optimize this. But I've also been reading about SIMD and SSE2 and things along that nature. So my question is, how do I store the variables (i,j,k) in registers so that they can be accessed faster by the CPU? I think the answer has to do with implementing SSE2 or some variation of it, but I have no idea how to do that. Any ideas?
Edit: This will be run on a Linux box, but using Intel's compiler I believe. If it helps, I have to run the following commands before I do anything to make sure the compiler works... source /opt/intel/Compiler/11.1/064/bin/intel64/iccvars_intel64.csh; source /opt/intel/tbb/2.2/bin/intel64/tbbvars.csh ... and then to compile I do: icc -ltbb test.cxx -o test
If there's no easy way to implement SSE2, any advice on how to further optimize the code?
Thanks, Hristo