I am writing a multi-threaded program in C where one core periodically grabs an item from the head of a linked list while other cores append items to the back of the list (using CAS magic for thread safety, someone else provided that for me). It appears that my program will run faster if the core taking an item from the head of the list simply initiates a prefetch for the next item, which is sure to be in another core's cache.
Currently I am targeting an AMD Opteron 6168, compiling with gcc on Debian Linux: I've tried to find documentation for this but I am in unfamiliar waters. All I can find is using -O3 to enable compiler-inserted prefetching (I think for loops) and some mentions of the AMD prefetch instruction names like PREFETCHW.
I do not know how to find the reference for what I'm after, or how to insert a statement like that into C, maybe as a block of assembly?