views:

263

answers:

6
#define IMGX 8192
#define IMGY 8192
int red_freq[256];
char img[IMGY][IMGX][3];

main(){ 

int i, j;
  long long total;
  long long redness;

  for (i = 0; i < 256; i++) 
    red_freq[i] = 0;

  for (i = 0; i < IMGY; i++) 
    for (j = 0; j < IMGX; j++) 
      red_freq[img[i][j][0]] += 1;

  total = 0;
  for (i = 0; i < 256; i++) 
    total += (long long)i * (long long)red_freq[i];

  redness = (total + (IMGX*IMGY/2))/(IMGX*IMGY);

what's the difference when you replace the second for loop into

for (j = 0; j < IMGX; j++) 
    for (i = 0; i < IMGY; i++) 
      red_freq[img[i][j][0]] += 1;

everything else are stay the same and why the first algorithm is faster than then second algorithm ?

Does it have something to do with the memory allocation?

+5  A: 

It is because the first version is iterating through the memory in the order that it is physically laid out, while the second one is jumping around in memory from one column in the array to the next. This will cause cache thrashing and interfere with the optimal performance of the CPU, which then has to spend lots of time waiting for the cache to be refreshed over and over again.

1800 INFORMATION
+8  A: 

The first version alters memory in sequence, so uses the processor cache optimally. The second version uses one value from each cache line it loads, so it pessimal for cache use.

The point to understand is that the cache is divided into lines, each of which will contain many values in the overall structure.

The first version might also be optimized by the compiler to use more clever instructions (SIMD instructions) which would be even faster.

Douglas Leeder
A: 

Yes it has something to do with memory allocation. The first loop indexes the inner dimension of img, which happens to span over only 3 bytes each time. That's within one memory page easily (i believe a common size here is 4kB for one page). But with your second version, the outer dimension's index changes fast. That will cause memory reads spread over a much larger range of memory - namely sizeof (char[IMGX][3]) bytes, which is 24kB. And with each change of the inner index, those jumps start to happen again. That will hit different pages and is probably somewhat slower. Also i heard the CPU reads ahead memory. That will make the first version benefit, because at the time it reads, that data is probably already in the cache. I can imagine the second version doesn't benefit from that, because it makes those large jumps around the memory back and forth.

I would suspect the difference is not that much, but if the algorithm runs many times, it eventually becomes noticeable. You probably want to read the article Row-major Order on wikipedia. That is the scheme used to store multi-dimensional arrays in C.

Johannes Schaub - litb
looks very suspicious. questioner accepts my answer. then i get one down-vote (tactical?), and another answer is accepted. i still can't find something wrong in this answer, so i must assume that down-vote was tactical...
Johannes Schaub - litb
+1  A: 

Due to how the memory is laid out the first version maintains data locality and therefore causes less cache misses.

Motti
+2  A: 

It's because big modern processor architectures (like the one in a PC) are massively optimised to work on memory which is 'near' (in address-related terms) memory which they've recently accessed. Actual physical memory access is much, much slower than the CPU can theoretically run, so everything which helps the process do its access in the most efficient fashion helps with performance.

It's pretty much impossibly to generalise more than that, but 'locality of reference' is a good thing to aim for.

Will Dean
+1  A: 

memory allocation happens only once and it is at the beginning so it can not be the reason. the reason is how the runtime calculates the address. In both cases memory address is calculated as

(i * (IMGY * IMGX)) + (j * IMGX) + 0

In the first algorithm

(i * (IMGY * IMGX)) gets calculates 8192 times
(j * IMGX) gets calculated 8192 * 8192 times

In the second algorithm

(i * (IMGY * IMGX)) gets calculates 8192 * 8192 times
(j * IMGX) gets calculated 8192 times

Since

(i * (IMGY * IMGX))

involves two multiplications, doing it more takes more time. that is the reason

Ender