views:

713

answers:

13

As I am using for-loops on large multi-dim arrays, any saving on the for-loop mechanism itself is meaningful.

Accordingly, I am looking for any tips on how to reduce this overhead.

e.g. : counting down using uint instead of int and != 0 as stop instead of >0 allows the CPU to do less work (heard it once, not sure it is always true)

+12  A: 

One important suggestion: move as much calculation to the outer loop as possible. Not all compilers can do that automatically. For eample, instead of:

for row = 0 to 999
    for col = 0 to 999
        cell[row*1000+col] = row * 7 + col

use:

for row = 0 to 999
    x = row * 1000
    y = row * 7
    for col = 0 to 999
        cell[x+col] = y + col
paxdiablo
Yes, that resonates with my advice: Make the inner loop fast. An example of this is Quicksort.
Peter G.
+1  A: 

As your loops will have O(n^d) complexity (d=dimension), what really counts is what you put INTO the loop, not the loop itself. Optimizing a few cycles away in the loop framework from millions of cycles of an inefficient algorithm inside the loop is just snake oil.

Thorsten79
I never found O notation useful unless compairing two algorithms that do same thing. It makes sense to say Bubble sort is O(n^2) whereas Quicksort is O(n lg n). It never made sense to me to say something is O(n^2), without something similar to compare it to.
Tony BenBrahim
To be pedantic: the basic implementation of Quicksort has average case complexity of O(n log n), but still has worst case complexity of O(n^2).
Mitch Wheat
We're not talking about comparing algorithms, Thorsten79 just wanted to point out that a nested for loop is going to calculate on the order of n^d times, and the smallness of the interior code is more important than the loop structure.
Karl
+5  A: 

Loop-unrolling can be one way. That is:

for (i=0; i<N; i++) {
  a[i]=...;
}

transforms into:

for (i=0; i<N; i+=4) {
  a[i]=...;
  a[i+1]=...;
  a[i+2]=...;
  a[i+3]=...;
}

You will need special handling when N is not a multiple of 4 in the example above.

norheim.se
What makes this more efficient? Especially in the case where N is not divisible by 4, and hence you're introducing extra if statement checks on top of the loop?
Matthew Scharley
If N is large, the relative overhead of those if statements is quite small. (They have to be kept outside the loop itself.) Also, the overhead introduced by the loop is in the example (almost) reduced to 1/4. Unrolling only makes sense when the operation carried out for each element is quick.
norheim.se
it makes a difference, however most self respecting compilers will do this already!
Tony BenBrahim
Last I tried measuring anything like this, the tightest loop was best. I think it's because it was better at using the instruction cache.
David Thornley
+6  A: 

Have you measured the overhead? Do you know how much time is spent processing the for loops vs. how much time is spent executing your application code? What is your goal?

Greg Hewgill
+4  A: 

This isn't a language agnostic question, it depends highly on not only language, but also compiler. Most compilers I believe will compile these two equivalently:

for (int i = 0; i < 10; i++) { /* ... */ }

int i = 0;
while (i < 10) {
    // ...
    i++;
}

In most languages/compilers, the for loop is just syntactic sugar for the later while loop. Foreach is another question again, and is highly dependant on language/compiler as to how it's implemented, but it's generally less efficient that a normal for/while loop. How much more so is again, language and compiler dependant.

Your best bet would probably be to run some benchmarks with several different variations on a theme and see what comes out on top.

Edit: To that end, the suggestions here will probably save you more time rather than worrying about the loop itself.

Matthew Scharley
+3  A: 

I agree with @Greg. First thing you need to do is put some benchmarking in place. There will be little point optimising anything until you prove where all your processing time is being spent. "Premature optimisation is the root of all evil"!

Mitch Wheat
+8  A: 

Try to make your loops contiguous in memory, this will optimize cache usage. That is, don't do this:

for (int i = 0; i < m; i++)  
    for (j = 0; j < n; j++)  
        s += arr[j][i];
  • If processing images, convert two loops to one loop on the pixels with a single index.
  • Don't make loops that will run zero times, as the pipeline is optimized to assume a loop will continue rather than end.
Lev
+4  A: 

BTW, unless you need post-increment, you should always use the pre-increment operator. It is only a minor difference, but it is more efficient.

Internally this is the difference:

  • Post Increment

    i++;

    is the same as:

    int postincrement( int &i )
    {
    int itmp = i;
    i = i + 1;
    return itmp;
    }

  • Pre Increment

    ++i;

    is the same as:

    int preincrement( int &i )
    {
    i = i + 1;
    return i;
    }

RichS
I think you meant to write ++i;
WW
When you're incrementing an int the compiler is very likely to optimized the difference away. this is more relevant when dealing with iterators.
shoosh
A: 

I think most compilers would probably do this anyway, stepping down to zero should be more efficient, as a check for zero is very fast for the processor. Again though, any compiler worth it's weight would do this with most loops anyway. You need to loo at what the compiler is doing.

Steve
A: 

There is not enough information to answer your question accurately. What are you doing inside your loops? Does the calculation in one iteration depend on a value calculated in a previous iteration. If not, you can almost cut your time in half by simply using 2 threads, assuming you have at least a dual core processor.

Another thing to look at is how you are accessing your data, if you are doing large array processing, to make sure that you access the data sequentially as it is stored in memory, avoiding flushing your L1/L2 cache on every iteration (seen this before on smaller L1 caches, the difference can be dramatic).

Again, I would look at what is inside the loop first, where most of the gains (>99%) will be, rather than the outer loop plumbing.

But then again, if your loop code is I/O bound, then any time spent on optimization is wasted.

Tony BenBrahim
+3  A: 

First, don't sweat the small stuff. Details like counting up versus counting down are usually completely irrelevant in running time. Humans are notoriously bad at spotting areas in code that need to be sped up. Use a profiler. Pay little or no attention to any part of the loop that is not repeated, unless the profiler says otherwise. Remember that what is written in an inner loop is not necessarily executed in an inner loop, as modern compilers are pretty smart about avoiding unnecessary repetition.

That being said, be very wary of unrolling loops on modern CPUs. The tighter they are, the better they will fit into cache. In a high-performance application I worked on last year, I improved performance significantly by using loops instead of straight-line code, and tightening them up as much as I could. (Yes, I profiled; the function in question took up 80% of the run time. I also benchmarked times over typical input, so I knew the changes helped.)

Moreover, there's no harm in developing habits that favor efficient code. In C++, you should get in the habit of using pre-increment (++i) rather than post-increment (i++) to increment loop variables. It usually doesn't matter, but can make a significant difference, it doesn't make code less readable or writable, and won't hurt.

David Thornley
A: 

There is some relevant information among the answers to another stackoverflow question, how cache memory works. I found the paper by Ulrich Drepper referred to in this answer especially useful.

Jacob Gabrielson
A: 

By the way, is it good to use short instead of int in for-loop if Int16 capacity is guaranteed to be enough?

abatishchev
In most modern computers 32 bit operations will be as fast as 16 bit. So, the answer is no it will not matter.
DasBoot