The theoretical maximum of memory bandwidth for a Core 2 processor with DDR3 dual channel memory is impressive: According to the Wikipedia article on the architecture, 10+ or 20+ gigabytes per second. However, stock memcpy() calls do not attain this. (3 GB/s is the highest I've seen on such systems.) Likely, this is due to the OS vendor requirement that memcpy() be tuned for every processor line based on the processor's characteristics, so a stock memcpy() implementation should be reasonable on a wide number of brands and lines.
My question: Is there a freely available, highly tuned version for Core 2 or Core i7 processors that can be utilized in a C program? I'm sure that I'm not the only person in need of one, and it would be a big waste of effort for everyone to micro-optimize their own memcpy().