One thing you can do is to call a function that has a lot of code and accesses a lot of memory in between calls to the item you are profiling. For example, in pseudo code (to be mostly language neutral):
// loop some number of times
{
//start timing
profile_func();
//stop timing
//add to total time
large_func(); // Uses lots of memory and has lots of code
}
// Compute time of profile func by dividing number of iterations by total time
The code in the large_func() can be nonsense code, like some set of ops repeated over and
over. The key is that it, or its code, does not get optimized out when you compile, so that it actually clears the code and data caches of the CPU (and, the L2 and L3 (if present) caches as well).
This is a very important test for many cases. The reason it is important is that small fast functions that are often profiled in isolation can run very fast, taking advantage of CPU cache, inlining and enregistration. But, often times, in large applications these advantages are absent, because of the context in which these fast functions are called.
As an example, just profiling a function by running it for a million iterations in a tight loop might show that the function executes in say 50 nanoseconds. Then you run it using the framework I showed above, and all of a sudden its running time can drastically increase to microseconds, because it can no longer take advantage of the fact that it has the entire processor - its registers and caches, to itself.