ansaurus

Question

Two loop bodies or one (result identical)

Answer 1

A:

This seems like something the compiler could optimize for you so instead of trying to figure it out yourself and making it fast, use whatever method makes your code more clear and readable. If you really must know, time both methods for input size and calculation type that your application uses (try the code you have now but repeat your calculations many many times and disable optimization).

btfx 2010-07-23 20:38:06

I have edited my question where I pointed out how I would not like the "premature optimization is evil" type of answer, which I believe your answer is like. The question is not about programming style or how to optimize or not optimize, it's about a typical scenario of two loops vs one on a given architecture with some defined parameters.

amn 2010-07-23 20:42:37

Disabling optimizations generally isn't a good idea: You'll benchmark something completely different than what you would really get when you use the code. You should benchmark with the same optimizations which you would for the real program, else it won't reflect the actual running times.

sth 2010-07-23 20:46:15

@sth : I meant if he wanted to see which method was computationally faster, he could disable optimization to get the same effect as counting clocks by hand.

btfx 2010-07-23 20:56:35

Answer 2

+8 A:

To measure is to know.

Jim Lewis 2010-07-23 20:51:26

Answer 3

+5 A:

Intuitively one loop is better: you increment i a million fewer times and all the other operation counts remain the same.

On the other hand it completely depends on f and g. If both are sufficiently large that each of their code or cacheable data that they use nearly fills a critical cache then swapping between f and g may completely swamp any single loop benefit.

As you say: it depends.

Charles Bailey 2010-07-23 21:15:49

That's exactly why I was curious - I think when `f` and `g` are complex enough so that each needs most of the cache(s) for itself, calling both one after another within one loop body will have a detrimental effect on performance, absolutely. But that's my uneducated opinion of course.

amn 2010-07-25 21:48:08

Answer 4

+2 A:

Your question is not clear enough to give a remotely accurate answer, but I think I understand where you are headed. The data you are iterating over is large enough that before you reach the end you will start to evict data so that the second time (second loop) you iterate over it some if not all will have to be read again.

If the two loops were joined so that each element/block is fetched for the first operation and then is already in cache for the second operation, then no matter how large the data is relative to the cache most if not all of the second operations will take their data from the cache.

Various things like the nature of the cache, the loop getting evicted by data then being fetched evicting data may cause some misses on the second operation. On a pc with an operating system, lots of evictions will occur with other programs getting time slices. But assuming an ideal world the first operation on index i of the data will fetch it from memory, the second operation will grab it from cache.

Tuning for a cache is difficult at best. I regularly demonstrate that even with an embedded system, no interrupts, single task, same source code. Execution time/performance can vary dramatically by simply changing compiler optimization options, changing compilers, both brands of compilers or versions of compilers, gcc 2.x vs 3.x vs 4.x (gcc is not necessarily producing faster code with newer versions btw)(and a compiler that is pretty good at a lot of targets is not really good at any one particular target). Same code different compilers or options can change execution time by several times, 3 times faster, 10 times faster, etc. Once you get into testing with or without a cache, it gets even more interesting. Add a single nop in your startup code so that your whole program moves one instruction over in memory and your cache lines now hit in different places. Same compiler same code. Repeat this with two nops, three nops, etc. Same compiler, same code you can see tens of percent (for the tests I ran that day on that target with that compiler) differences better and worse. That doesnt mean you cant tune for a cache, it just means that trying to figure out if your tuning is helping or hurting can be difficult. The normal answer is just "time it and see", but that doesnt work anymore, and you might get great results on your computer that day with that program with that compiler. But tomorrow on your computer or any day on someone elses computer you may be making things slower not faster. You need to understand why this or that change made it faster, maybe it had nothing to do with your code, your email program may have been downloading a lot of mail in the background during one test and not during the other.

Assuming I understood your question correctly I think the single loop is probably faster in general.

dwelch 2010-07-24 05:55:59

I am not iterating over data, I am iterating over a a series of numbers, a notion known as mathematical set, to be precise. In layman programmer terms `for(int i = 0; i < n; i++)` But thanks for a lot of valuable knowledge in the rest of your answer. The question is left somewhat vague, because I want the general picture of things. If there is one, of course.

amn 2010-07-25 21:33:14

@amn Does the set live in memory or registers or where?

dwelch 2010-07-26 14:25:14

@amn what is it that is in cache/memory that you are trying to optimize?

dwelch 2010-07-26 14:25:33

Hello. Well, the question arose of general curiosity. Occasionally, and not just when using C, I am in a similar situation with an actual code. The set does not live anywhere, because the loop is `for(int i = 0; i < 1000000; i++)` and so only `i` will most likely be in memory and various caches at all times, depending on how smart the compiler is. I am trying to optimize the work done by each loop, but as both `f` and `g` are hypothetical, I admit it may be too vague for a definite answer.

amn 2010-07-26 15:41:27

depending on the loop and how much code is in the loop, etc i can be in a register and never go to memory (even if a location is reserved for it). What kills you if you are only looking at the loop variable is the branch to the top of the loop, flushing the pipe, its like any branch, not as bad as a function call that has to setup the arguments, but it does cause a pipeline flush.

dwelch 2010-07-26 16:44:24

if I was in memory (stack if local variable, main memory if global) then it would be cached. maybe an occasional bump from cache but next time through the loop back in cache. two loops would mean twice as many reads from cache which still costs you so two loops is slower with or without a cache as far as the loop variable i goes.

dwelch 2010-07-26 16:46:40

Answer 5

A:

If I came across the two-loop version in code, with no explanatory comments, I would wonder why the programmer did it that way, and probably consider the technique to be of dubious quality, whereas a one-loop version would not be surprising, commented or not.

But if I came across the two-loop version along with a comment like "I'm using two loops because it runs X% faster in the cache on CPU Y", at least I'd no longer be puzzled by the code, although I'd still question if it was true and applicable to other machines.

joe snyder 2010-07-24 06:22:58

The quality of the code is irrelevant, I thought I made it pretty clear? I would also sneer at the funky two-loop version because it is too verbose for its own good, but I wanted an answer as far as one can be given, not the age old discussion of code clarity, optimizations, and its level of generics. As much as I appreciate attention to my question and a good discussion, I am downvoting this.

amn 2010-07-25 21:35:51

@amn: no, you only referenced "optimization", not quality. and your desire to claim "all else being equal" for two unequal pieces of code is questionable. cherrypicking what factors are allowed to be considered to determine which code is "faster" is just an artificial puzzle that i think will more likely lead to bad habits than good programming.

joe snyder 2010-07-26 04:12:04

Answer 6

+3 A:

I can see three variables (even in a seemingly simple chunk of code):

What do f() and g() do? Can one of them invalidate all of the instruction cache lines (effectively pushing the other one out)? Can that happen in L2 instruction cache too (unlikely)? Then keeping only one of them in it might be beneficial. Note: The inverse does not imply "have a single loop", because:
Do f() and g() operate on large amounts of data, according to i? Then, it'd be nice to know if they operate on the same set of data - again you have to consider whether operating on two different sets screws you up via cache misses.
If f() and g() are indeed that primitive as you first state, and I'm assuming both in code size as well as running time and code complexity, cache locality issues won't arise in little chunks of code like this - your biggest concern would be if some other process were scheduled with actual work to do, and invalidated all the caches until it were your process' turn to run.

A final thought: given that such processes like above might be a rare occurrence in your system (and I'm using "rare" quite liberally), you could consider making both your functions inline, and let the compiler unroll the loop. That is because for the instruction cache, faulting back to L2 is no big deal, and the probability that the single cache line that'd contain i, j, k would be invalidated in that loop doesn't look so horrible. However, if that's not the case, some more details would be useful.

Michael Foukarakis 2010-07-24 06:51:45

Given how the question was indeed too vague, I think your answer is THE answer here. Thanks.

amn 2010-07-25 21:44:41

Answer 7

+1 A:

Breaking the loops into smaller chunks is a good idea.. It could improves the cache-hit ratio quite a lot and can make a lot of difference to the performance...

From your example:

int j = 0, k = 0;

for(int i = 0; i < 1000000; i++)
{
    j += f(i);
}

for(int i = 0; i < 1000000; i++)
{
    k += g(i);
}

I would either fuse the two loops into one loop like this:

int j = 0, k = 0;

for(int i = 0; i < 1000000; i++)
{
    j += f(i);
    k += g(i);
}

Of if this is not possible do the optimization called Loop-Tiling:

#define TILE_SIZE 1000 /* or whatever you like - pick a number that keeps   */
                       /* the working-set below your first level cache size */

int i=0; 
int elements = 100000;

do {
  int n = i+TILE_SIZE; 
  if (n > elements) n = elements;

  // perform loop A
  for (int a=i; a<n; a++)
  {
    j += f(i);
  }

  // perform loop B
  for (int a=i; a<n; a++)
  {
    k += g(i);
  }

  i += n
} while (i != elements)

The trick with loop tiling is, that if the loops share an access pattern the second loop body has the chance to re-use the data that has already been read into the cache by the first loop body. This won't happen if you execute loop A a million times because the cache is not large enough to hold all this data.

Breaking the loop into smaller chunks and executing them one after another will help here a lot. The trick is to limit the working-set of memory below the size of your first level cache. I aim for half the size of the cache, so other threads that get executed in-between don't mess up my cache so much..

Nils Pipenbrinck 2010-07-25 22:07:32

ansaurus

tags:

views:

answers:

Two loop bodies or one (result identical)

related questions