views:

1295

answers:

5
+3  Q: 

CPU cache flush

I am interested in forcing a CPU cache flush in Windows (for benchmarking reasons, I want to emulate starting with no data in cpu cache), preferably a basic C implementation or win32 call. Is there a known way to do this with a system call or even something as sneaky as doing say a large memcopy? Intel i686 platform (P4 and up is okay as well).

A: 

Edit: I'm assuming you want to measure performance variances between cache hits and cache misses? I would suggest running 1,000 or more of an operation, once easily fitting in the cache (maybe 1/100th or smaller of the total cache size), and once much larger than the cache size (maybe 100 times). However, any kind of performance test like this will be pretty flawed without an understanding of the cache strategy used, the cache page size, dozens of optimizations made by the engineers who designed the chip, etc. I advise running the test conditions I presented, calling it good, and keeping your sanity.

No. Modern commercial CPUs are way to abstract and complex to ever guarantee complete emptiness, especially with the operating system loaded. Even if you did "flush" it, the operating system would be quickly filling it with data used in background services, the benchmarking program, etc.

marr75
Downvoted both for being technically incorrect and for obscuring the correct answers below it. It's true that caches are shared across processors (in increasingly complicated ways as the years progress), so it's not always possible for a single process to exert this kind of control over the memory environment.But in a test rig, as the question asks, it sounds doable to me. Certainly you **can** flush the cache.
Andy Ross
I think I'll do the memcpy of a large chunk of data. This is for testing some number crunching code performance. I will just loop once with the memcpy idea by itself in one test run and then on the next I'll include the number crunching and see what the time delta is. Thank you.
He suggested a 'test rig' using windows as the operating system. It's not doable, every answer here and every part about the hardware abstraction of windows makes it so. Additionally, from a computer engineering standpoint, the cache being shared across processes is the least complicated piece of this puzzle, more important, even in the raw test rig setup required (DOS or lower level operating system), the caching strategy, cache size, addresses, and cache page size will vary dramatically based on the hardware.
marr75
As an example of what I mean, a computer engineer might design a processor with a cache strategy that instead of dropping the least recently used cache page, drops one at random, this will result in slightly more cache misses in many cases but in some edge cases will result in far fewer. You can't assume to know what will be in the cache using portable c code and you can't write the kind of test the OP suggests without a great deal of knowledge of the implementation details of the processor. You can very easily get some results that are good enough for most purposes from our suggestions.
marr75
THe OP doesn't write a word about portability. Maybe he wants to measure HIS code running on HIS box? If you want to know something more about the "implementation details" of the cache and the memory, i suggest having a look at http://people.redhat.com/drepper/cpumemory.pdf
drhirsch
A: 

You can try the FlushInstructionCache function.

Adam Goode
While I'm not a windows programmer, if that function holds true to its name, it only clears the instruction cache, the OP wants to clear the data cache.
Falaina
That's the instruction cache, and it only "Flushes the instruction cache for the specified process." It's also for managed C++, and it just marks that the code should be reloaded, no cleaning is done.
marr75
Indeed, it only seems to flush the L1 instruction cache. Presumably, program code cached in L2 cache, or data in L1 data cache, remain unaffected.
intgr
It's not for managed C++.
Adam Goode
Clarification, managed by the operating system, my apologies, it really just marks some code pages for the process as out of date, so it still won't do what the OP is looking for.
marr75
Yeah. I accept my downvote.
Adam Goode
A: 

There are x86 assembly instructions to force the CPU to flush certain cache lines (such as CLFLUSH), but they are pretty obscure. CLFLUSH in particular only flushes a chosen address from L1 caches.

something as sneaky as doing say a large memcopy?

Yes, this is the simplest approach, and will make sure that the CPU flushes all levels of cache. Just exclude the cache flushing time from your benchmakrs and you should get a good idea how your program performs under cache pressure.

intgr
"will make sure that the CPU flushes all levels of cache."Not true, as I stated, modern commercial cpus, especially when abstracted by an operating system, can (and probably do) have very complicated caching strategies.
marr75
I believe you are confusing the CPU cache with other OS-level caches. The OS has basically no say in what the CPU will cache or not cache, because these decisions need to happen so quickly, there is no time for kernel interrupts or anything of the like. CPU cache is implemented purely in silicon.
intgr
A context switch will indeed let other processes run and thereby pollute the cache. But this is normal part of OS behavior -- it will take place with or without the benchmark, so it makes sense to include this in your timings anyway.
intgr
A: 

There is unfortunately no way to explicitly flush the cache. A few of your options are:

1.) Thrash the cache by doing some very large memory operations between iterations of the code you're benchmarking.

2.) Enable Cache Disable in the x86 Control Registers and benchmark that. This will probably disable the instruction cache also, which may not be what you want.

3.) Implement the portion of your code your benchmarking (if it's possible) using Non-Temporal instructions. Though, these are just hints to the processor about using the cache, it's still free to do what it wants.

1 is probably the easiest and sufficient for your purposes.

Edit: Oops, I stand corrected there is an instruction to invalidate the x86 cache, see drhirsch's answer

Falaina
Your claim that there is no instruction for cache flushing is wrong. And rewriting a routine using non temporal instructions for benchmarking is nonsense. If the data the routine is using fits in the caches, it would run way slower during the benchmarking, making the measurements worthless.
drhirsch
There is no way to explicitly flush the cache from windows. You are denied direct access to the hardware... there are non-portable assembly instructions that can do it.
marr75
You can easily do it in Windows 95,98, ME. And even for the modern windows variants you can implement it in ring 0 using a driver.
drhirsch
@drhirsch While I do stand corrected on the instruction for flushing the cache (thanks!), I disagree with your assessment of the use of non-temporal instructions. If he did the initial data loads for his benchmark using non-temporal instructions it isn't that much different from running with an empty cache and would be a sufficient way to simulate cold cache misses (though, I admit not nearly as correct as using the flush instruction!)
Falaina
I apollogize, I was a bit harsh. But you can't modify a program using non temporal instructions to simulate cold cache behavior for benchmarking. 1) You would need to unroll exactly one loop and make it nontemporal, thus changing the control flow and the usage of the inctruction cache. 2) If the data resides in cache before the start, even non temporal instructions will load the data from the cache, and you will get a warm cache result. 3) If not, the second iteration will need to fetch the data from memory again, you will get a result with doubled memory latencies.
drhirsch
+4  A: 

Fortunatly, there is more than one way to explictly flush the caches.

The instruction "wbinvd" writes back modified cache content and marks the caches empty. It executes a bus cycle to make external chaches flush their data. Unfortunatly, it is a priveleged instruction. But if it is possible to run the test program under something like DOS, this is the way to go. This has the advantage of keeping the cache footprint of the "OS" very small.

Additionally, there is the "invd" instruction, which invalidates caches without flushing them back to main memory. This violates the coherency of main memory and cache, so you have to take care of that by yourself. Not really recommended.

For benchmarking purposes, the simplest solution is probably copying a large memory block to a region marked with WC (write combining) instead of WB. The memory mapped region of the graphics card is a good candidate, or you can mark a region as WC by yourself via the MTRR registers.

You can find some resources about benchmarking short routines at Test programs for measuring clock cycles and performance monitoring.

drhirsch
Ohh, I stand correct, neat I didn't know about this instruction.
Falaina