views:

173

answers:

7

I have long wondered what is more efficient with regards to making better use of CPU caches (which are known to benefit from locality of reference) - two loops each iterating over the same mathematical set of numbers, each with a different loop body, or having one loop that "concatenates" the two bodies into one, and thus accomplishes identical total result, but all in itself?

In my opinion, having two loops would introduce fewer cache misses and evictions because more instructions and data used by the loop fit in the cache. Am I right?

Assuming:

  1. Cost of f and g each is negligible compared to cost of completing the entire loop containing each
  2. f and g use most of the cache each by itself, and so the cache will be invalidated by one being called after another (which would be the case with a single-loop-body version)
  3. Intel Core Duo CPU
  4. C language source code
  5. gcc compiler, no switches

The set being iterated over is a mathematical set, not a container of numbers in memory like a vector or a list. See the example below.

Please no answers of the "premature optimization is evil" character :-)

An example of the two-loops version that I am advocating for:

int j = 0, k = 0;

for(int i = 0; i < 1000000; i++)
{
    j += f(i);
}

for(int i = 0; i < 1000000; i++)
{
    k += g(i);
}
A: 

This seems like something the compiler could optimize for you so instead of trying to figure it out yourself and making it fast, use whatever method makes your code more clear and readable. If you really must know, time both methods for input size and calculation type that your application uses (try the code you have now but repeat your calculations many many times and disable optimization).

btfx
I have edited my question where I pointed out how I would not like the "premature optimization is evil" type of answer, which I believe your answer is like. The question is not about programming style or how to optimize or not optimize, it's about a typical scenario of two loops vs one on a given architecture with some defined parameters.
amn
Disabling optimizations generally isn't a good idea: You'll benchmark something completely different than what you would really get when you use the code. You should benchmark with the same optimizations which you would for the real program, else it won't reflect the actual running times.
sth
@sth : I meant if he wanted to see which method was computationally faster, he could disable optimization to get the same effect as counting clocks by hand.
btfx
+8  A: 

To measure is to know.

Jim Lewis
+5  A: 

Intuitively one loop is better: you increment i a million fewer times and all the other operation counts remain the same.

On the other hand it completely depends on f and g. If both are sufficiently large that each of their code or cacheable data that they use nearly fills a critical cache then swapping between f and g may completely swamp any single loop benefit.

As you say: it depends.

Charles Bailey
That's exactly why I was curious - I think when `f` and `g` are complex enough so that each needs most of the cache(s) for itself, calling both one after another within one loop body will have a detrimental effect on performance, absolutely. But that's my uneducated opinion of course.
amn
+2  A: 

Your question is not clear enough to give a remotely accurate answer, but I think I understand where you are headed. The data you are iterating over is large enough that before you reach the end you will start to evict data so that the second time (second loop) you iterate over it some if not all will have to be read again.

If the two loops were joined so that each element/block is fetched for the first operation and then is already in cache for the second operation, then no matter how large the data is relative to the cache most if not all of the second operations will take their data from the cache.

Various things like the nature of the cache, the loop getting evicted by data then being fetched evicting data may cause some misses on the second operation. On a pc with an operating system, lots of evictions will occur with other programs getting time slices. But assuming an ideal world the first operation on index i of the data will fetch it from memory, the second operation will grab it from cache.

Tuning for a cache is difficult at best. I regularly demonstrate that even with an embedded system, no interrupts, single task, same source code. Execution time/performance can vary dramatically by simply changing compiler optimization options, changing compilers, both brands of compilers or versions of compilers, gcc 2.x vs 3.x vs 4.x (gcc is not necessarily producing faster code with newer versions btw)(and a compiler that is pretty good at a lot of targets is not really good at any one particular target). Same code different compilers or options can change execution time by several times, 3 times faster, 10 times faster, etc. Once you get into testing with or without a cache, it gets even more interesting. Add a single nop in your startup code so that your whole program moves one instruction over in memory and your cache lines now hit in different places. Same compiler same code. Repeat this with two nops, three nops, etc. Same compiler, same code you can see tens of percent (for the tests I ran that day on that target with that compiler) differences better and worse. That doesnt mean you cant tune for a cache, it just means that trying to figure out if your tuning is helping or hurting can be difficult. The normal answer is just "time it and see", but that doesnt work anymore, and you might get great results on your computer that day with that program with that compiler. But tomorrow on your computer or any day on someone elses computer you may be making things slower not faster. You need to understand why this or that change made it faster, maybe it had nothing to do with your code, your email program may have been downloading a lot of mail in the background during one test and not during the other.

Assuming I understood your question correctly I think the single loop is probably faster in general.

dwelch
I am not iterating over data, I am iterating over a a series of numbers, a notion known as mathematical set, to be precise. In layman programmer terms `for(int i = 0; i < n; i++)` But thanks for a lot of valuable knowledge in the rest of your answer. The question is left somewhat vague, because I want the general picture of things. If there is one, of course.
amn
@amn Does the set live in memory or registers or where?
dwelch
@amn what is it that is in cache/memory that you are trying to optimize?
dwelch
Hello. Well, the question arose of general curiosity. Occasionally, and not just when using C, I am in a similar situation with an actual code. The set does not live anywhere, because the loop is `for(int i = 0; i < 1000000; i++)` and so only `i` will most likely be in memory and various caches at all times, depending on how smart the compiler is. I am trying to optimize the work done by each loop, but as both `f` and `g` are hypothetical, I admit it may be too vague for a definite answer.
amn
depending on the loop and how much code is in the loop, etc i can be in a register and never go to memory (even if a location is reserved for it). What kills you if you are only looking at the loop variable is the branch to the top of the loop, flushing the pipe, its like any branch, not as bad as a function call that has to setup the arguments, but it does cause a pipeline flush.
dwelch
if I was in memory (stack if local variable, main memory if global) then it would be cached. maybe an occasional bump from cache but next time through the loop back in cache. two loops would mean twice as many reads from cache which still costs you so two loops is slower with or without a cache as far as the loop variable i goes.
dwelch
A: 

If I came across the two-loop version in code, with no explanatory comments, I would wonder why the programmer did it that way, and probably consider the technique to be of dubious quality, whereas a one-loop version would not be surprising, commented or not.

But if I came across the two-loop version along with a comment like "I'm using two loops because it runs X% faster in the cache on CPU Y", at least I'd no longer be puzzled by the code, although I'd still question if it was true and applicable to other machines.

joe snyder
The quality of the code is irrelevant, I thought I made it pretty clear? I would also sneer at the funky two-loop version because it is too verbose for its own good, but I wanted an answer as far as one can be given, not the age old discussion of code clarity, optimizations, and its level of generics. As much as I appreciate attention to my question and a good discussion, I am downvoting this.
amn
@amn: no, you only referenced "optimization", not quality. and your desire to claim "all else being equal" for two unequal pieces of code is questionable. cherrypicking what factors are allowed to be considered to determine which code is "faster" is just an artificial puzzle that i think will more likely lead to bad habits than good programming.
joe snyder
+3  A: 

I can see three variables (even in a seemingly simple chunk of code):

  • What do f() and g() do? Can one of them invalidate all of the instruction cache lines (effectively pushing the other one out)? Can that happen in L2 instruction cache too (unlikely)? Then keeping only one of them in it might be beneficial. Note: The inverse does not imply "have a single loop", because:
  • Do f() and g() operate on large amounts of data, according to i? Then, it'd be nice to know if they operate on the same set of data - again you have to consider whether operating on two different sets screws you up via cache misses.
  • If f() and g() are indeed that primitive as you first state, and I'm assuming both in code size as well as running time and code complexity, cache locality issues won't arise in little chunks of code like this - your biggest concern would be if some other process were scheduled with actual work to do, and invalidated all the caches until it were your process' turn to run.

A final thought: given that such processes like above might be a rare occurrence in your system (and I'm using "rare" quite liberally), you could consider making both your functions inline, and let the compiler unroll the loop. That is because for the instruction cache, faulting back to L2 is no big deal, and the probability that the single cache line that'd contain i, j, k would be invalidated in that loop doesn't look so horrible. However, if that's not the case, some more details would be useful.

Michael Foukarakis
Given how the question was indeed too vague, I think your answer is THE answer here. Thanks.
amn
+1  A: 

Breaking the loops into smaller chunks is a good idea.. It could improves the cache-hit ratio quite a lot and can make a lot of difference to the performance...

From your example:

int j = 0, k = 0;

for(int i = 0; i < 1000000; i++)
{
    j += f(i);
}

for(int i = 0; i < 1000000; i++)
{
    k += g(i);
}

I would either fuse the two loops into one loop like this:

int j = 0, k = 0;

for(int i = 0; i < 1000000; i++)
{
    j += f(i);
    k += g(i);
}

Of if this is not possible do the optimization called Loop-Tiling:

#define TILE_SIZE 1000 /* or whatever you like - pick a number that keeps   */
                       /* the working-set below your first level cache size */

int i=0; 
int elements = 100000;

do {
  int n = i+TILE_SIZE; 
  if (n > elements) n = elements;

  // perform loop A
  for (int a=i; a<n; a++)
  {
    j += f(i);
  }

  // perform loop B
  for (int a=i; a<n; a++)
  {
    k += g(i);
  }

  i += n
} while (i != elements)

The trick with loop tiling is, that if the loops share an access pattern the second loop body has the chance to re-use the data that has already been read into the cache by the first loop body. This won't happen if you execute loop A a million times because the cache is not large enough to hold all this data.

Breaking the loop into smaller chunks and executing them one after another will help here a lot. The trick is to limit the working-set of memory below the size of your first level cache. I aim for half the size of the cache, so other threads that get executed in-between don't mess up my cache so much..

Nils Pipenbrinck