Background: I've implemented a stochastic algorithm that requires random ordering for best convergence. Doing so obviously destroys memory locality, however. I've found that by prefetching the next iteration's data, the performance drop is minimized.
I can prefetch n cache lines using _mm_prefetch
in a simple, mostly OS+compiler-portable fashion - but what's the length of a cache line? Right now, I'm using a hardcoded value of 64, which seems to be the norm nowadays on x64 processors - but I don't know how to detect this at runtime, and a question about this last year found no simple solution.
I've seen GetLogicalProcessorInformation on windows but I'm leery of using such a complex API for something so simple, and that won't work on macs or linux anyhow.
Perhaps there's some entirely other API/intrinsic that could prefetch a memory region identified in terms of bytes (or words, or whatever) and allows me to prefetch without knowing the cache line length?
Basically, is there a reasonable alternative to _mm_prefetch
with #define CACHE_LINE_LEN 64
?