views:

486

answers:

6

Hi all,

On a mailing list I'm subscribed to, two fairly knowledgeable (IMO) programmers were discussing some optimized code, and saying something along the lines of:

"On the CPUs released 5-8 years ago, it was slightly faster to iterate for loops backwards (e.g. for (int i=x-1; i>=0; i--) {...}) because comparing i to zero is more efficient than comparing it to some other number. But with very recent CPUs (e.g. from 2008-2009) the speculative loader logic is such that it works better if the for loop is iterated forward (e.g. for (int i=0; i< x; i++) {...})"

My question is, is that true? Have CPU implementations changed recently such that forward-loop-iterating now has an advantage over backward-iterating? If so, what is the explanation for that? i.e. what changed?

(Yes, I know, premature optimization is the root of all evil, review my algorithm before worrying about micro-optimizations, etc etc.... mostly I'm just curious)

+2  A: 

No, we can't say that CPU implementations have changed to make forward looping faster. And that has very little to do with the CPUs themselves.

It has to do with the fact that you haven't specified which CPU you're talking about, nor which compiler.

You cannot ask a blanket question about CPU issues with the C tag and expect to get an intelligent answer simply because nothing in the C standard mandates how fast CPU s should be at various operations.

If you'd like to rephrase your question to target a specific CPU and machine language (since what machine language you get out of a C compiler depends entirely on the compiler), you may get a better answer.

In either case, it shouldn't matter. You should be relying on the fact that the people that wrote your compiler know a great deal more than you about how to eke the last inch of performance from the various CPUs.

The direction in which you should be iterating has always been dictated by what you have to do. For example, if you have to process array elements in ascending order, you use:

for (i = 0; i < 1000; i++) { process (a[i]); }

rather than:

for (i = 999; i >= 0; i--) { process (a[999-i]); }

simply because any advantage you may gain in going backwards is more than swamped by the extra calculations on i. It may well be that a naked loop (no work done in the body) may be faster in one direction than another but, if you have such a naked loop, it's not doing any real work anyway.

As an aside, it may well be that both those loops above will come down to the same machine code anyway. I've seen some of the code put out by the GCC optimizer and it made my head spin. Compiler writers are, in my opinion, a species alone when it comes to insane levels of optimization.

My advice: always program for readability first then target any specific performance problems you have ("get it working first, then get it working fast").

paxdiablo
It's really annoying that nobody ever answers performance questions on here. People ask a question that might have an interesting answer, then all the parrots come out and say "get it working first, then get it working fast". Yes. That's a great rule of thumb, but what if someone (god forbid) actually got to the "get it working fast" part? They'll never find an answer on SO.
tgamblin
To add to that, there are PLENTY of loops in high performance computing where traversal order matters a lot. It's not just forward or back, but what memory you touch in the loop and how it hits your cache and how well the prefetcher can predict it that will determine performance. Look at z-order traversals, which were MADE to preserve locality in situations like this.
tgamblin
@tgamblin, there *is* no answer based on the available data simply because there are holes in the question. You *don't* know what the compiler will output, you *don't* know what the CPU is, and so on. And I've answered plenty of the "people at the get it working fast" stage here on SO. The answer there is to profile and target problem areas (all of which depends on the target environments).
paxdiablo
... If you want performance, you have two ways to go. Code in assembly for your target environment or trust in the compiler writers. I prefer the latter. In any case, you're free to answer the question differently if you're not happy with my answer.
paxdiablo
One can safely assume that he wants it to know for the most common consumer CPUs, which are Intel's and AMD's at the moment.
Georg
You may assume that, I prefer not to. And which compiler? VC, GCC, Code::Blocks? Which version of said compiler? My point is there are too many unknowns in the question to make a blanket statement. And even if all that *were* known, I still stand by my comment that the compiler writers will leave regular folk eating their dust in terms of optimization.
paxdiablo
I was hoping to learn something interesting about recent trends in speculative execution technology, not to get a lecture about the evils of premature optimization. That's why I included the final paragraph in my question -- to head off exactly this sort of unhelpful response.
Jeremy Friesner
Then you should not have asked about C code. There is a disconnect between C and the CPU which you haven't made a link for. You would have been better off asking about assembly and specifying the CPUs your interested in.
paxdiablo
@pax: That is a whole lot of BS. You don't need to talk about specific assembly to talk about performance. I work in high performance computing and the vast majority of optimizations people make to C code (and C++, and Fortran) don't involve touching assembly at all. They involve restructuring code (especially loops) to perform better with different memory systems, compilers, processors, etc. You CAN give general recommendations about performance, and you can talk about how C code will perform on different machines in general without mentioning specific ones.
tgamblin
Well, then we're going to have to agree to disagree on that one. @JeremyF, if you actually think this answer is unhelpful, let me know and I'll delete it. I'm trying to help but if it's unappreciated, I'd rather move on.
paxdiablo
A: 

When optimizing loops i'd rather look into loop unrolling (as it cuts down the number of comparisons vs. the exit value, and it may be optimized for parallel processing (mmx) depending on what goes on inside the loop).

diciu
+10  A: 

I don't know. But I do know how to write a quick benchmark with no guarantees of scientific validity (actually, one with rather strict guarantees of invalidity). It has interesting results:

#include <time.h>
#include <stdio.h>

int main(void)
{
    int i;
    int s;
    clock_t start_time, end_time;
    int centiseconds;

    start_time = clock();
    s = 1;
    for (i = 0; i < 1000000000; i++)
    {
        s = s + i;
    }
    end_time = clock();
    centiseconds = (end_time - start_time)*100 / CLOCKS_PER_SEC;
    printf("Answer is %d; Forward took %ld centiseconds\n", s, centiseconds);

    start_time = clock();
    s = 1;
    for (i = 999999999; i >= 0; i--)
    {
        s = s + i;
    }
    end_time = clock();
    centiseconds = (end_time - start_time)*100 / CLOCKS_PER_SEC;
    printf("Answer is %d; Backward took %ld centiseconds\n", s, centiseconds);

    return 0;
}

Compiled with -O9 using gcc 3.4.4 on Cygwin, running on an "AMD Athlon(tm) 64 Processor 3500+" (2211 MHz) in 32 bit Windows XP:

Answer is -1243309311; Forward took 93 centiseconds
Answer is -1243309311; Backward took 92 centiseconds

(Answers varied by 1 either way in several repetitions.)

Compiled with -I9 using gcc 4.4.1 running on an "Intel(R) Atom(TM) CPU N270 @ 1.60GHz" (800 MHz and presumably only one core, given the program) in 32 bit Ubuntu Linux.

Answer is -1243309311; Forward took 196 centiseconds
Answer is -1243309311; Backward took 228 centiseconds

(Answers varied by 1 either way in several repetitions.)

Looking at the code, the forward loop is translated to:

; Gcc 3.4.4 on Cygwin for Athlon      ; Gcc 4.4.1 on Ubuntu for Atom
L5:                                .L2:
    addl    %eax, %ebx                 addl    %eax, %ebx
    incl    %eax                       addl    $1, %eax
    cmpl    $999999999, %eax           cmpl    $1000000000, %eax
    jle     L5                         jne     .L2

The backward to:

L9:                                .L3:
    addl    %eax, %ebx                 addl    %eax, %ebx
    decl    %eax                       subl    $1, $eax
    jns     L9                         cmpl    $-1, %eax
                                       jne .L3

Which shows, if not much else, that GCC's behaviour has changed between those two versions!

Pasting the older GCC's loops into the newer GCC's asm file gives results of:

Answer is -1243309311; Forward took 194 centiseconds
Answer is -1243309311; Backward took 133 centiseconds

Summary: on the >5 year old Athlon, the loops generated by GCC 3.4.4 are the same speed. On the newish (<1 year?) Atom, the backward loop is significantly faster. GCC 4.4.1 has a slight regression for this particular case which I personally am not bothered about in the least, given the point of it. (I had to make sure that s is used after the loop, because otherwise the compiler would elide the computation altogether.)

[1] I can never remember the command for system info...

Edmund
You can get a decent amount of CPU info with `cat /proc/cpuinfo`
tgamblin
@tgamblin: thanks! I thought there was something in /bin too, but this one is enough. It even works in Cygwin which is a pleasant surprise.
Edmund
Try just running pure repetition; does the compiler optimize it to a simple `loop` instruction?
Electro
@Electro: if you don't use the control variable (e.g. print it out or something), some compilers will just remove empty loops altogether.
tgamblin
+2  A: 

Yes. but with a caveat. The idea that looping backwards is faster never applied to all older CPUs. It's an x86 thing (as in 8086 through 486, possibly Pentium, although I don't think any further).

That optimization never applied to any other CPU architecture that I know of.

Here's why.

The 8086 had a register that was specifically optimized for use as a loop counter. You put your loop count in CX, and then there are several instructions that decrement CX and then set condition codes if it goes to zero. In fact there was an instruction prefix you could put before other instructions (the REP prefix) that would basically iterate the other instruction until CX got to 0.

Back in the days when we counted instructions and instructions had known fixed cycle counts using cx as your loop counter was the way to go, and cx was optimized for counting down.

But that was a long time ago. Ever since the Pentium, those complex instructions have been slower overall than using more, and simpler instructions. (RISC baby!) The key thing we try to do these days is try to put some time between loading a register and using it because the pipelines can actually do multiple things per cycle as long as you don't try to use the same register for more than one thing at a time.

Nowdays the thing that kills performance isn't the comparison, it's the branching, and then only when the branch prediction predicts wrong.

John Knoeller
+7  A: 

You're really asking about prefetching, not about loop control logic.

In general, loop performance isn't going to be dictated by the control logic (i.e. the increment/decrement and the condition that gets checked every time through). The time it takes to do these things is inconsequential except in very tight loops. If you're interested in that, take a look at John Knoeller's answer for specifics on the 8086's counter register and why it might've been true in the old days that counting down was more efficient. As John says, branch prediction (and also speculation) can play a role in performance here, as can instruction prefetching.

Iteration order can affect performance significantly when it changes the order in which your loop touches memory. The order in which you request memory addresses can affect what is drawn into your cache and also what is evicted from your cache when there is no longer room to fetch new cache lines. Having to go to memory more often than needed is much more expensive than compares, increments, or decrements. On modern CPUs it can take thousands of cycles to get from the processor to memory, and your processor may have to idle for some or all of that time.

You're probably familiar with caches, so I won't go into all those details here. What you may not know is that modern processors employ a whole slew of prefetchers to try to predict what data you're going to need next at different levels of the memory hierarchy. Once they predict, they try to pull that data from memory or lower level caches so that you have what you need when you get around to processing it. Depending on how well they grab what you need next, your performance may or may not improve when using them.

Take a look at Intel's guide to optimizing for hardware prefetchers. There are four prefetchers listed; two for NetBurst chips:

  1. NetBurst's hardware prefetcher can detect streams of memory accesses in either forward or backward directions, and it will try to load data from those locations into the L2 cache.
  2. NetBurst also has an adjacent cache line (ACL) prefetcher, which will automatically load two adjacent cache lines when you fetch the first one.

and two for Core:

  1. Core has a slightly more sophisticated hardware prefetcher; it can detect strided access in addition to streams of contiguous references, so it'll do better if you step through an array every other element, every 4th, etc.
  2. Core also has an ACL prefetcher like NetBurst.

If you're iterating through an array forward, you're going to generate a bunch of sequential, usually contiguous memory references. The ACL prefetchers are going to do much better for forward loops (because you'll end up using those subsequent cache lines) than for backward loops, but you may do ok making memory references backward if the prefetchers can detect this (as with the hardware prefetchers). The hardware prefetchers on the Core can detect strides, which is helpful for for more sophisticated array traversals.

These simple heuristics can get you into trouble in some cases. For example, Intel actually recommends that you turn off adjacent cache line prefetching for servers, because they tend to make more random memory references than desktop user machines. The probability of not using an adjacent cache line is higher on a server, so fetching data you're not actually going to use ends up polluting your cache (filling it with unwanted data), and performance suffers. For more on addressing this kind of problem, take a look at this paper from Supercomputing 2009 on using machine learning to tune prefetchers in large data centers. Some guys at Google are on that paper; performance is something that is of great concern to them.

Simple heuristics aren't going to help you with more sophisticated algorithms, and you might have to start thinking about the sizes of your L1, L2, etc. caches. Image processing, for example, often requires that you perform some operation on subsections of a 2D image, but the order you traverse the image can affect how well useful pieces of it stay in your cache without being evicted. Take a look at Z-order traversals and loop tiling if you're interested in this sort of thing. It's a pretty basic example of mapping the 2D locality of image data to the 1D locality of memory to improve performance. It's also an area where compilers aren't always able to restructure your code in the best way, but manually restructuring your C code can improve cache performance drastically.

I hope this gives you an idea of how iteration order affects memory performance. It does depend on the particular architecture, but the ideas are general. You should be able to understand prefetching on AMD and Power if you can understand it on Intel, and you don't really have to know assembly to structure your code to take advantage of memory. You just need to know a little computer architecture.

tgamblin
A: 

It probably doesn't make a hoot of difference speed-wise, but I often write:

for (i = n; --i >= 0; ) blah blah

which I think at one time generated cleaner assembly.

Of course, in answering this kind of question, I run the risk of affirming that this is important. It's a micro-optimization kind of question, which is closely related to premature optimization, which everybody says you shouldn't do, but nevertheless SO is awash in it.

Mike Dunlavey