views:

483

answers:

11

Just calculating sum of two arrays with slight modification in code

int main()

{

    int a[10000]={0}; //initialize something
    int b[10000]={0}; //initialize something

    int sumA=0, sumB=0;

    for(int i=0; i<10000; i++)
    {
     sumA += a[i];
     sumB += b[i];
    }
        printf("%d %d",sumA,sumB);

}

OR

int main()

{

    int a[10000]={0}; //initialize something
    int b[10000]={0}; //initialize something

    int sumA=0, sumB=0;

    for(int i=0; i<10000; i++)
    {
     sumA += a[i];

    }

        for(int i=0; i<10000; i++)
    {  
     sumB += b[i];
    }
        printf("%d %d",sumA,sumB); 
}

Which code will be faster.

+4  A: 

The first one will be faster. The compiler will not need to repeat the loop twice. Although not much work, bu some cycles are lost on incrementing the cycle variable and performing the check condition.

Developer Art
New what I thought is probably in second case you are accessing linearly and if size will be big enough than those will be available in cache, but in case of first both(assume) are not in cache then it would be slow.
GG
+1  A: 

The first one will be faster because you loop from 1 to 10000 only one time.

Danny T.
The number of loops is not the only relevant quantity involved. You also need to take into account the number of operations inside each loop - which, if you do, essentially amount to the same number of operations in both code snippets. The question then is dependent of cache memory hit/miss ratio (and perhaps other factors I'm forgetting atm).
Michael Foukarakis
+24  A: 

There is only one way to know, and that is to test and measure. You need to work out where your bottleneck is (cpu, memory bandwidth etc).

The size of the data in your array (int's in your example) would affect the result, as this would have an impact into the use of the processor cache. Often, you will find example 2 is faster, which basically means your memory bandwidth is the limiting factor (example 2 will access memory in a more efficient way).

rikh
> There is only one way to know, and that is to test and measureYou could have stopped there :)
sbk
+1  A: 

The first one will probably be faster. The memory access pattern will allow the (modern) CPU to manage the caches efficiently (prefetch), even while accessing two arrays.

Much faster if your CPU allows it and the arrays are aligned: use SSE3 instructions to process 4 int at a time.

Eric Bainville
+6  A: 

In theory, due to cache optimizations the second one should be faster.

Caches are optimized to bring and keep chunks of data so that for the first access you'll get a big chunk of the first array into cache. In the first code, it may happen that when you access the second array you might have to take out some of the data of the first array, therefore requiring more accesses.

In practice both approach will take more or less the same time, being the first a little better given the size of actual caches and the likehood of no data at all being taken out of the cache.

Note: This sounds a lot like homework. In real life for those sizes first option will be slightly faster, but this only applies to this concrete example, nested loops, bigger arrays or specially smaller cache sizes would have a significant impact in performance depending on the order.

Jorge Córdoba
In this example though you'd need a tiny amount of very simple cache to see any cache optimisation hit. In a more complex example (ie one where the loop won't fit in the instruction cache) you may well be correct.
Goz
No Jorge its nothing like homework, I found this somewhere in example of refactoring thats why I asked.
GG
+2  A: 

Also you probably meant a[i] instead of a[10000] inside the loop :-)

Tilka
(this kind of remark should be in a comment... which you have not enough rep to make, yet, of course...)
xtofl
To earn rep one should give useful answers, not answers that looks like a comments.
Kirill V. Lyadvinsky
+1  A: 

C++ Standard says nothing about it, it is implementation dependent. It is looks like you are trying to do premature optimization. It is shouldn't bother you until it is not a bottleneck in your program. If it so, you should use some profiler to find out which one will be faster on certain platform.

Until that, I'd prefer first variant because it looks more readable (or better std::accumulate).

Kirill V. Lyadvinsky
A: 

If you meant a[i] instead of a[10000] (and for b, respectively) and if your compiler performs loop distribution optimizations, the first one will be exactly the same as the second. If not, the second will perform slightly better.

If a[10000] is intended, then both loops will perform exactly the same (with trivial cache and flow optimizations).

Food for thought for some answers that were voted up: how many additions are performed in each version of the code?

Michael Foukarakis
Reagrding "food for thought" - it doesn't matter how many additions are performed, provided they're performed while the CPU would otherwise be doing nothing waiting on a memory load. Basically, either one could be faster, and we could say "ah, yes, that's to be expected because X" ;-)
Steve Jessop
You can't do algorithm analysis like that. ;)
Michael Foukarakis
Watch me :-)
Steve Jessop
+8  A: 

Here's some code with timing, built using VS2005:

#include <windows.h>
#include <iostream>
using namespace std;
int main ()
{
  LARGE_INTEGER
    start,
    middle,
    end;

  const int
    count = 1000000;

  int
    *a = new int [count],
    *b = new int [count],
    *c = new int [count],
    *d = new int [count],
    suma = 0,
    sumb = 0,
    sumc = 0,
    sumd = 0;

  QueryPerformanceCounter (&start);
  for (int i = 0 ; i < count ; ++i)
  {
    suma += a [i];
    sumb += b [i];
  }
  QueryPerformanceCounter (&middle);
  for (int i = 0 ; i < count ; ++i)
  {
    sumc += c [i];
  }
  for (int i = 0 ; i < count ; ++i)
  {
    sumd += d [i];
  }
  QueryPerformanceCounter (&end);
  cout << "Time taken = " << (middle.QuadPart - start.QuadPart) << endl;
  cout << "Time taken = " << (end.QuadPart - middle.QuadPart) << endl;
  cout << "Done." << endl << suma << sumb << sumc << sumd;
  return 0;
}

Running this, the latter version is usually faster.

I tried writing some assembler to beat the second loop but my attempts were usually slower. So I decided to see what the compiler had generated. Here's the optimised assembler produced for the main summation loop in the second version:

00401110  mov         edx,dword ptr [eax-0Ch] 
00401113  add         edx,dword ptr [eax-8] 
00401116  add         eax,14h 
00401119  add         edx,dword ptr [eax-18h] 
0040111C  add         edx,dword ptr [eax-10h] 
0040111F  add         edx,dword ptr [eax-14h] 
00401122  add         ebx,edx 
00401124  sub         ecx,1 
00401127  jne         main+110h (401110h)

Here's the register usage:

  • eax = used to index the array
  • ebx = the grand total
  • ecx = loop counter
  • edx = sum of the five integers accessed in one iteration of the loop

There are a few interesting things here:

  • The compiler has unrolled the loop five times.
  • The order of memory access is not contiguous.
  • It updates the array index in the middle of the loop.
  • It sums five integers then adds that to the grand total.

To really understand why this is fast, you'd need to use Intel's VTune performance analyser to see where the CPU and memory stalls are as this code is quite counter-intuitive.

Skizz

Skizz
Interesting. What is your CPU?
Eric Bainville
It's an old 3GHz Pentium 4 with 1GB of RAM and running WinXP. No details on memory type, motherboard or chipset which would have a bigger impact on performance.
Skizz
The code you've posted might be misleading. Just because you use different arrays doesn't mean the cache won't get the full 4 arrays on one burst. If it does then the second loop will be faster because it doesn't have to bring the data to the cache, it's already there.
Jorge Córdoba
That's why "performance unit testing" might be as misleading as everything :)
Jorge Córdoba
I've tried to make two points in my answer: the first is that the only way to tell which is faster is to time actual running code. The second is that modern CPUs are complex and getting the most out of them is not easy and not intuitive, especially without tools to help out.
Skizz
Thinking about what Jorge said, the arrays are nearly 4Mb each so it's highly unlikely all four will be in the cache at the same time, thus distorting the timings for the second test. Desktop PCs currently have an upper limit of 8Mb cache, but generally about 2Mb, so that'll hold two of the four arrays at best and half an array on average.
Skizz
Skizz, this is a nice example why a good optimizer beats hand written assembly code hands down these days. The people writing these optimizers spend so much time studying the intrinsics of the target hardware to squeeze everything possible out of it that hardly any assembly programmer is able to follow. Upvote for your nice explanation.
karx11erx
I just tried a simple SIMD version and that was 15-20% faster than the optimised fast one.
Skizz
+1  A: 

If the data type size is enough large not to cache both variables (as example 1), but single variable (example 2), then the code of first example will be slower than the code of second example. Otherwise code of first example will be faster than the second one.

Sadat
+3  A: 

For me (GCC -O3) measuring shows that the second version is faster by some 25%, which can be explained with more efficient memory access pattern (all memory accesses are close to each other, not all over the place). Of course you'll need to repeat the operation thousands of times before the difference becomes significant.

I also tried std::accumulate from the numeric header which is the simple way to implement the second version and was in turn a tiny amount faster than the second version (probably due to more compiler-friendly looping mechanism?):

sumA = std::accumulate(a, a + 10000, 0);
sumB = std::accumulate(b, b + 10000, 0);
UncleBens