views:

421

answers:

3

Some time ago I wrote a little piece of code to ask about on interviews and see how people understand concepts of cache and memory:

#include "stdafx.h"
#include <stdlib.h>
#include <windows.h>
#include <iostream>

#define TOTAL 0x20000000

using namespace std;

__int64 count(int INNER, int OUTER)
{
    int a = 0;
    int* arr = (int*) HeapAlloc(GetProcessHeap(), 0, INNER * sizeof(int));
    if (!arr) {
     cerr << "HeapAlloc failed\n";
     return 1;
    }
    LARGE_INTEGER freq;
    LARGE_INTEGER startTime, endTime;
    __int64 elapsedTime, elapsedMilliseconds;
    QueryPerformanceFrequency(&freq);
    QueryPerformanceCounter(&startTime);

    /* Начало работы */

    for (int i = 0; i < OUTER; i++) {
     for (int j = 0; j < INNER; j++) {
      a |= i;
      arr[j] = a;
     }
    }

    /* Конец работы */

    QueryPerformanceCounter(&endTime);
    elapsedTime = endTime.QuadPart - startTime.QuadPart;
    elapsedMilliseconds = (1000 * elapsedTime) / freq.QuadPart;
    HeapFree(GetProcessHeap(), 0, arr);
    return elapsedMilliseconds;
}

int _tmain(int argc, _TCHAR* argv[])
{
    __int64 time;
    for (int INNER = 0x10; INNER <= 0x2000000; INNER <<= 1) {
     int OUTER = TOTAL / INNER;
     time = count(INNER, OUTER);
     cout << INNER << "\t" << OUTER << "\t" << time << "\n";
    }
}

That's what it compiles to (the loop itself):

00401062  xor         ecx,ecx 
00401064  test        ebp,ebp 
00401066  jle         count+83h (401083h) 
00401068  xor         eax,eax 
0040106A  test        ebx,ebx 
0040106C  jle         count+7Ch (40107Ch) 
0040106E  mov         edi,edi 
00401070  or          esi,ecx 
00401072  mov         dword ptr [edi+eax*4],esi 
00401075  add         eax,1 
00401078  cmp         eax,ebx 
0040107A  jl          count+70h (401070h) 
0040107C  add         ecx,1 
0040107F  cmp         ecx,ebp 
00401081  jl          count+68h (401068h)

That's what the program outputs on my machine:


LOG2(INNER) LOG2(OUTER)  Time, ms
4           25           523
5           24           569
6           23           441
7           22           400
8           21           367
9           20           358
10          19           349
11          18           364
12          17           378
13          16           384
14          15           357
15          14           377
16          13           379
17          12           390
18          11           386
19          10           419
20          9            995
21          8            1,015
22          7            1,038
23          6            1,071
24          5            1,082
25          4            1,119


I ask people to explain what's going on. As inner array grows, the number of cycles decreases, as the time does. As inner array outgrows the cache, cache misses begin to happen and time raises. It's all right so far.

BUT: when the INNER array size is 16 (that gives us 64 bytes of data), there is little performance boost, despite the greater number of jmps in code. It's little (523 vs. 569), but reproducible.

The question is: why this boost?

+7  A: 

Probably because 64 is the cache line size on your machine, and you basically run each iteration fully out of a single cache line.

Anteru
+3  A: 

I'm not sure what you're asking. Are you just checking what we know? Why bother? Or are you asking us because you can't explain the boost at 64bytes yourself? Or is it to find out if this is a good interview question, or...?

Anyway, I can't say I like your code. If the intention is simply to provide a background for your interview question, you should remove all unnecessary code. Is HeapAlloc important? Why couldn't the array be declared on the stack? Or with new/malloc?

Why do you need to do error handling in this little test program? Again, it just distracts and adds more noise. The same goes for the QPC calls. And I won't even ask why you need a precompiled header in this.

In order to ask the interviewee about a 6-line loop, he has to sift through 16 lines of irrelevant noise. Why?

And as mentioned in a comment, the output table is basically unreadable.

I'm all for seeing if an interviewee can read code, but I don't see the point making a question about performance and cache characteristics so hard to read.

In any case, at 64-byte, one INNER array fits exactly into most CPU's cache lines. Which means that exactly one cache line has to be read for each iteration of OUTER.

jalf
I posted whole code here to let you just copy and compile it -- in case you wanted. On an actual interview, for sure, I show only the loop. Sorry if I hurt your feelings :)
Quassnoi
A: 

Just to clarify, some of the spaces in his table should actually be ,'s. you all can figure it out.

PiNoYBoY82