In trying to figure out whether or not my code's inner loop is hitting a hardware design barrier or a lack of understanding on my part barrier. There's a bit more to it, but the simplest question I can come up with to answer is as follows:
If I have the following code:
float px[32768],py[32768],pz[32768];
float xref, yref, zref, deltax, deltay, deltaz;
initialize_with_random(px);
initialize_with_random(py);
initialize_with_random(pz);
for(i=0;i<32768-1;i++) {
xref=px[i];
yref=py[i];
zref=pz[i];
for(j=0;j<32768-1;j++ {
deltx=xref-px[j];
delty=yref-py[j];
deltz=zref-pz[j];
} }
What type of maximum theoretical speed up would I be able to see by going to SSE instructions in a situation where I have complete control over code (assembly, intrinsics, whatever) but no control over runtime environment other than architecture (i.e. it's a multi-user environment so I can't do anything about how the OS kernel assigns time to my particular process).
Right now I'm seeing a speed up of 3x with my code, when I would have thought using SSE would give me much more vector depth than the 3x speed up is indicating (presumably the 3x speed up tells me I have a 4x maximum theoretical throughput). (I've tried things such as letting deltx/delty/deltz be arrays in case the compiler wasn't smart enough to auto-promote them, but I still see only 3x speed up.) I'm using the intel C compiler with the appropriate compiler flags for vectorization, but no intrinsics obviously.