Some time ago I wrote a little piece of code to ask about on interviews and see how people understand concepts of cache and memory:
#include "stdafx.h"
#include <stdlib.h>
#include <windows.h>
#include <iostream>
#define TOTAL 0x20000000
using namespace std;
__int64 count(int INNER, int OUTER)
{
int a = 0;
int* arr = (int*) HeapAlloc(GetProcessHeap(), 0, INNER * sizeof(int));
if (!arr) {
cerr << "HeapAlloc failed\n";
return 1;
}
LARGE_INTEGER freq;
LARGE_INTEGER startTime, endTime;
__int64 elapsedTime, elapsedMilliseconds;
QueryPerformanceFrequency(&freq);
QueryPerformanceCounter(&startTime);
/* Начало работы */
for (int i = 0; i < OUTER; i++) {
for (int j = 0; j < INNER; j++) {
a |= i;
arr[j] = a;
}
}
/* Конец работы */
QueryPerformanceCounter(&endTime);
elapsedTime = endTime.QuadPart - startTime.QuadPart;
elapsedMilliseconds = (1000 * elapsedTime) / freq.QuadPart;
HeapFree(GetProcessHeap(), 0, arr);
return elapsedMilliseconds;
}
int _tmain(int argc, _TCHAR* argv[])
{
__int64 time;
for (int INNER = 0x10; INNER <= 0x2000000; INNER <<= 1) {
int OUTER = TOTAL / INNER;
time = count(INNER, OUTER);
cout << INNER << "\t" << OUTER << "\t" << time << "\n";
}
}
That's what it compiles to (the loop itself):
00401062 xor ecx,ecx
00401064 test ebp,ebp
00401066 jle count+83h (401083h)
00401068 xor eax,eax
0040106A test ebx,ebx
0040106C jle count+7Ch (40107Ch)
0040106E mov edi,edi
00401070 or esi,ecx
00401072 mov dword ptr [edi+eax*4],esi
00401075 add eax,1
00401078 cmp eax,ebx
0040107A jl count+70h (401070h)
0040107C add ecx,1
0040107F cmp ecx,ebp
00401081 jl count+68h (401068h)
That's what the program outputs on my machine:
LOG2(INNER) LOG2(OUTER) Time, ms
4 25 523
5 24 569
6 23 441
7 22 400
8 21 367
9 20 358
10 19 349
11 18 364
12 17 378
13 16 384
14 15 357
15 14 377
16 13 379
17 12 390
18 11 386
19 10 419
20 9 995
21 8 1,015
22 7 1,038
23 6 1,071
24 5 1,082
25 4 1,119
I ask people to explain what's going on. As inner array grows, the number of cycles decreases, as the time does. As inner array outgrows the cache, cache misses begin to happen and time raises. It's all right so far.
BUT: when the INNER array size is 16 (that gives us 64 bytes of data), there is little performance boost, despite the greater number of jmps
in code. It's little (523 vs. 569), but reproducible.
The question is: why this boost?