Has anyone had experience using prefetch instructions for the Core 2 Duo processor?
I've been using the (standard?) prefetch set (prefetchnta, prefetcht1, etc) with success for a series of P4 machines, but when running the code on a Core 2 Duo it seems that the prefetcht(i) instructions do nothing, and that the prefetchnta instruction is less effective.
My criteria for assessing performance is the timing results for a BLAS 1 vector-vector (axpy) operation, when the vector size is large enough for out-of-cache behaviour.
Have Intel introduced new prefetch instructions?