hello.
I have ran into curious problem. Algorithm I working on consist of lots of computations like this
q = x(0)*y(0)*z(0) + x(1)*y(1)*z(1) + ...
where the length of summation is between 4 and 7
The original computations are all done using 64-bit precision. For experimentation, I try using 32-bit precision for x,y,z input values (so that computations are performed using 32-bit), and storing final result as 64-bit value (straightforward cast).
I expected 32-bit performance to be better (cache size, SIMD size, etc.) but to my surprise there was no difference in performance, maybe even decrease.
the architecture in question is Intel 64, linux, g++. both codes do seem to use SSE and arrays in both cases are aligned to 16 byte boundary.
Why would it be so? my guess so far is that 32-bit precision can use SSE only on the first 4 elements, with the rest being done serially compounded by cast overhead.
thank you