I am interested in forcing a CPU cache flush in Windows (for benchmarking reasons, I want to emulate starting with no data in cpu cache), preferably a basic C implementation or win32 call. Is there a known way to do this with a system call or even something as sneaky as doing say a large memcopy? Intel i686 platform (P4 and up is okay as well).
Edit: I'm assuming you want to measure performance variances between cache hits and cache misses? I would suggest running 1,000 or more of an operation, once easily fitting in the cache (maybe 1/100th or smaller of the total cache size), and once much larger than the cache size (maybe 100 times). However, any kind of performance test like this will be pretty flawed without an understanding of the cache strategy used, the cache page size, dozens of optimizations made by the engineers who designed the chip, etc. I advise running the test conditions I presented, calling it good, and keeping your sanity.
No. Modern commercial CPUs are way to abstract and complex to ever guarantee complete emptiness, especially with the operating system loaded. Even if you did "flush" it, the operating system would be quickly filling it with data used in background services, the benchmarking program, etc.
There are x86 assembly instructions to force the CPU to flush certain cache lines (such as CLFLUSH), but they are pretty obscure. CLFLUSH in particular only flushes a chosen address from L1 caches.
something as sneaky as doing say a large memcopy?
Yes, this is the simplest approach, and will make sure that the CPU flushes all levels of cache. Just exclude the cache flushing time from your benchmakrs and you should get a good idea how your program performs under cache pressure.
There is unfortunately no way to explicitly flush the cache. A few of your options are:
1.) Thrash the cache by doing some very large memory operations between iterations of the code you're benchmarking.
2.) Enable Cache Disable in the x86 Control Registers and benchmark that. This will probably disable the instruction cache also, which may not be what you want.
3.) Implement the portion of your code your benchmarking (if it's possible) using Non-Temporal instructions. Though, these are just hints to the processor about using the cache, it's still free to do what it wants.
1 is probably the easiest and sufficient for your purposes.
Edit: Oops, I stand corrected there is an instruction to invalidate the x86 cache, see drhirsch's answer
Fortunatly, there is more than one way to explictly flush the caches.
The instruction "wbinvd" writes back modified cache content and marks the caches empty. It executes a bus cycle to make external chaches flush their data. Unfortunatly, it is a priveleged instruction. But if it is possible to run the test program under something like DOS, this is the way to go. This has the advantage of keeping the cache footprint of the "OS" very small.
Additionally, there is the "invd" instruction, which invalidates caches without flushing them back to main memory. This violates the coherency of main memory and cache, so you have to take care of that by yourself. Not really recommended.
For benchmarking purposes, the simplest solution is probably copying a large memory block to a region marked with WC (write combining) instead of WB. The memory mapped region of the graphics card is a good candidate, or you can mark a region as WC by yourself via the MTRR registers.
You can find some resources about benchmarking short routines at Test programs for measuring clock cycles and performance monitoring.