Very interesting, I never stopped to consider that can be such a huge difference simply by accessing the array indexes "non-sequentially". I gave it a try myself, and also found the following code had the second loop taking between 2 - 3 times longer :
// Hmmm, how to insert blank lines in the code formatter???
static void Main(string[] args)
{
Stopwatch timer = new Stopwatch();
int arraySize = 10000;
// First array, access X by Y
int[,] xbyy = new int[arraySize, arraySize];
timer.Start();
for (int x = 0; x < arraySize; x++)
for (int y = 0; y < arraySize; y++)
{
xbyy[x, y] = 15;
}
timer.Stop();
TimeSpan duration = timer.Elapsed;
string realTimeFormat = string.Format("{0:00} minutes {1:00} seconds {2:000} milliseconds",
duration.Minutes, duration.Seconds, duration.Milliseconds);
Console.WriteLine("X by Y took " + realTimeFormat);
// Seecond array, access Y by X
int[,] ybyx = new int[arraySize, arraySize];
timer.Start();
for (int x = 0; x < arraySize; x++)
for (int y = 0; y < arraySize; y++)
{
ybyx[y, x] = 15;
}
timer.Stop();
duration = timer.Elapsed;
realTimeFormat = string.Format("{0:00} minutes {1:00} seconds {2:000} milliseconds",
duration.Minutes, duration.Seconds, duration.Milliseconds);
Console.WriteLine("Y by X took " + realTimeFormat);
Console.ReadLine();
}
To keep things short, here are the emitted IL snippets for the X by Y loop and the Y by X loop.
Initial code prior to looping,
.method private hidebysig static void Main(string[] args) cil managed
{
.entrypoint
// Code size 290 (0x122)
.maxstack 4
.locals init ([0] class [System]System.Diagnostics.Stopwatch timer,
[1] int32 arraySize,
[2] int32[0...,0...] xbyy,
[3] int32 x,
[4] int32 y,
[5] valuetype [mscorlib]System.TimeSpan duration,
[6] string realTimeFormat,
[7] int32[0...,0...] ybyx,
[8] int32 V_8,
[9] int32 V_9)
IL_0000: newobj instance void [System]System.Diagnostics.Stopwatch::.ctor()
IL_0005: stloc.0
IL_0006: ldc.i4 0x2710
IL_000b: stloc.1
looping X by Y
IL_000c: ldloc.1
IL_000d: ldloc.1
IL_000e: newobj instance void int32[0...,0...]::.ctor(int32,
int32)
IL_0013: stloc.2
IL_0014: ldloc.0
IL_0015: callvirt instance void [System]System.Diagnostics.Stopwatch::Start()
IL_001a: ldc.i4.0
IL_001b: stloc.3
IL_001c: br.s IL_003d
IL_001e: ldc.i4.0
IL_001f: stloc.s y
IL_0021: br.s IL_0034
IL_0023: ldloc.2
IL_0024: ldloc.3
IL_0025: ldloc.s y
IL_0027: ldc.i4.s 15
IL_0029: call instance void int32[0...,0...]::Set(int32,
int32,
int32)
IL_002e: ldloc.s y
IL_0030: ldc.i4.1
IL_0031: add
IL_0032: stloc.s y
IL_0034: ldloc.s y
IL_0036: ldloc.1
IL_0037: blt.s IL_0023
IL_0039: ldloc.3
IL_003a: ldc.i4.1
IL_003b: add
IL_003c: stloc.3
IL_003d: ldloc.3
IL_003e: ldloc.1
IL_003f: blt.s IL_001e
IL_0041: ldloc.0
looping Y by X
IL_0090: ldloc.1
IL_0091: ldloc.1
IL_0092: newobj instance void int32[0...,0...]::.ctor(int32,
int32)
IL_0097: stloc.s ybyx
IL_0099: ldloc.0
IL_009a: callvirt instance void [System]System.Diagnostics.Stopwatch::Start()
IL_009f: ldc.i4.0
IL_00a0: stloc.s V_8
IL_00a2: br.s IL_00c7
IL_00a4: ldc.i4.0
IL_00a5: stloc.s V_9
IL_00a7: br.s IL_00bc
IL_00a9: ldloc.s ybyx
IL_00ab: ldloc.s V_9
IL_00ad: ldloc.s V_8
IL_00af: ldc.i4.s 15
IL_00b1: call instance void int32[0...,0...]::Set(int32,
int32,
int32)
IL_00b6: ldloc.s V_9
IL_00b8: ldc.i4.1
IL_00b9: add
IL_00ba: stloc.s V_9
IL_00bc: ldloc.s V_9
IL_00be: ldloc.1
IL_00bf: blt.s IL_00a9
IL_00c1: ldloc.s V_8
IL_00c3: ldc.i4.1
IL_00c4: add
IL_00c5: stloc.s V_8
IL_00c7: ldloc.s V_8
IL_00c9: ldloc.1
IL_00ca: blt.s IL_00a4
IL_00cc: ldloc.0
The IL logic flow is somewhat similar. The main difference I can observe is the first loop manages to use stloc and ldloc for positions 2 and 3 for the first array and the X index variable. By the time it came to the second loop, it had expended the maxstack and thus used the stloc.s and ldloc.s instructions. I believe this is the difference between referencing variables on the stack versus on the heap, and contributing to the slower performance.
Now, if you swap the order in which the loops are tested, so that the Y by X loop gets run first to access stack references, you will see the timing durations get reversed.
UPDATE: I was wrong about referencing stack or heap addresses. It just seems that the first four variables in a method are more efficient to reference with the dedicated stloc.0, 1, 3, 4 and ldloc.0, 1, 3, 4 opcodes.
http://weblogs.asp.net/mnolton/archive/2004/01/09/48992.aspx