tags:

views:

289

answers:

4

Basically you have two ways for doing this:

for (int x = 0; x < UPPER_X; x++)
    for (int y = 0; y < UPPER_Y; y++)
    {        
        arr1[x, y] = get_value();
        arr2[y, x] = get_value();
    }

The only difference is what variable to change in the inner loop: first or second. I heard that the results differ from language to language.

What is right order for .NET?

+4  A: 

You should benchmark your particular circumstances to be sure.

You would think that there would be no difference for rectangular arrays (i.e. contiguously allocated memory), BUT according to this MSDN article there is a difference:

You can get even better results by converting a multidimensional array into a single-dimensional array. If you don't mind the syntax, this can be trivial; just use one index as an offset. For example, the following declares a single-dimensional array to be used as a two-dimensional array:

double[] myArray = new double[ROW_DIM * COLUMN_DIM];

For indexing the elements of this array, use the following offset:

myArray[row * COLUMN_DIM + column];

This will undoubtedly be faster than an equivalent jagged or rectangular array.

Mitch Wheat
I think this should be well within the reach of the compiler's optimisation, but I do recall doing the same thing in C in the dim and distant past.
Mitch Wheat
This only works if you vary column faster than row. If you do it the other way around, you lose the locality of reference and could suffer cache-miss penalties.
tvanfosson
Thanks for the MSDN link.
Saif Khan
+1  A: 

So I did the benchmark and the results are that access to arr1 is three times faster.

catbert
Can you edit your answer or question with the benchmark specifics? I'd have to see some very, very compelling evidence to give this performance tweak a second thought. Do you have a measurable performance problem related to this?
Michael Haren
I think it is useful to post your resultant IL code that ran the timing tests. Look at my own assessment code in my answer post.
icelava
+3  A: 

The reason one is faster than the other has to do with the processor's cache, and the way the data is laid out in memory.

There are two normal ways to store the two-dimensional data is in one-dimensional address space, Either you can store all of the data for the first row, then the second row, and so on (aka row major order), or you can do it by columns (aka column major order). Below is what the memory locations would be for a 3x3 array for both of these options.

Rows:

1 2 3
4 5 6
7 8 9

Columns:

1 4 7
2 5 8
3 6 9

When you access one memory location, an entire cache line (which can be between 8 and 512 bytes, according to Wikipedia) is loaded into the cache. So if you access the next memory location it will already be in the cache. Thus, it can be much faster to access memory sequentially than to jump around in the address space. Thus, with large two-dimensional arrays, there can be a significant speed difference between choosing rows or columns as your inner loop variable.

Daniel Plaisted
+1 for locality of reference
tvanfosson
+1  A: 

Very interesting, I never stopped to consider that can be such a huge difference simply by accessing the array indexes "non-sequentially". I gave it a try myself, and also found the following code had the second loop taking between 2 - 3 times longer :

// Hmmm, how to insert blank lines in the code formatter???
static void Main(string[] args)
{
 Stopwatch timer = new Stopwatch();
 int arraySize = 10000;

 // First array, access X by Y
 int[,] xbyy = new int[arraySize, arraySize];


 timer.Start();
 for (int x = 0; x < arraySize; x++)
  for (int y = 0; y < arraySize; y++)
  {
   xbyy[x, y] = 15;
  }
 timer.Stop();
 TimeSpan duration = timer.Elapsed;
 string realTimeFormat = string.Format("{0:00} minutes {1:00} seconds {2:000} milliseconds",
     duration.Minutes, duration.Seconds, duration.Milliseconds);
 Console.WriteLine("X by Y took " + realTimeFormat);



 // Seecond array, access Y by X
 int[,] ybyx = new int[arraySize, arraySize];

 timer.Start();
 for (int x = 0; x < arraySize; x++)
  for (int y = 0; y < arraySize; y++)
  {
   ybyx[y, x] = 15;
  }
 timer.Stop();
 duration = timer.Elapsed;
 realTimeFormat = string.Format("{0:00} minutes {1:00} seconds {2:000} milliseconds",
    duration.Minutes, duration.Seconds, duration.Milliseconds);
 Console.WriteLine("Y by X took " + realTimeFormat);


 Console.ReadLine();
}

To keep things short, here are the emitted IL snippets for the X by Y loop and the Y by X loop.

Initial code prior to looping,

.method private hidebysig static void  Main(string[] args) cil managed
{
  .entrypoint
  // Code size       290 (0x122)
  .maxstack  4
  .locals init ([0] class [System]System.Diagnostics.Stopwatch timer,
           [1] int32 arraySize,
           [2] int32[0...,0...] xbyy,
           [3] int32 x,
           [4] int32 y,
           [5] valuetype [mscorlib]System.TimeSpan duration,
           [6] string realTimeFormat,
           [7] int32[0...,0...] ybyx,
           [8] int32 V_8,
           [9] int32 V_9)
  IL_0000:  newobj     instance void [System]System.Diagnostics.Stopwatch::.ctor()
  IL_0005:  stloc.0
  IL_0006:  ldc.i4     0x2710
  IL_000b:  stloc.1

looping X by Y

  IL_000c:  ldloc.1
  IL_000d:  ldloc.1
  IL_000e:  newobj     instance void int32[0...,0...]::.ctor(int32,
                                                             int32)
  IL_0013:  stloc.2
  IL_0014:  ldloc.0
  IL_0015:  callvirt   instance void [System]System.Diagnostics.Stopwatch::Start()
  IL_001a:  ldc.i4.0
  IL_001b:  stloc.3
  IL_001c:  br.s       IL_003d
  IL_001e:  ldc.i4.0
  IL_001f:  stloc.s    y
  IL_0021:  br.s       IL_0034
  IL_0023:  ldloc.2
  IL_0024:  ldloc.3
  IL_0025:  ldloc.s    y
  IL_0027:  ldc.i4.s   15
  IL_0029:  call       instance void int32[0...,0...]::Set(int32,
                                                           int32,
                                                           int32)
  IL_002e:  ldloc.s    y
  IL_0030:  ldc.i4.1
  IL_0031:  add
  IL_0032:  stloc.s    y
  IL_0034:  ldloc.s    y
  IL_0036:  ldloc.1
  IL_0037:  blt.s      IL_0023
  IL_0039:  ldloc.3
  IL_003a:  ldc.i4.1
  IL_003b:  add
  IL_003c:  stloc.3
  IL_003d:  ldloc.3
  IL_003e:  ldloc.1
  IL_003f:  blt.s      IL_001e
  IL_0041:  ldloc.0

looping Y by X

  IL_0090:  ldloc.1
  IL_0091:  ldloc.1
  IL_0092:  newobj     instance void int32[0...,0...]::.ctor(int32,
                                                             int32)
  IL_0097:  stloc.s    ybyx
  IL_0099:  ldloc.0
  IL_009a:  callvirt   instance void [System]System.Diagnostics.Stopwatch::Start()
  IL_009f:  ldc.i4.0
  IL_00a0:  stloc.s    V_8
  IL_00a2:  br.s       IL_00c7
  IL_00a4:  ldc.i4.0
  IL_00a5:  stloc.s    V_9
  IL_00a7:  br.s       IL_00bc
  IL_00a9:  ldloc.s    ybyx
  IL_00ab:  ldloc.s    V_9
  IL_00ad:  ldloc.s    V_8
  IL_00af:  ldc.i4.s   15
  IL_00b1:  call       instance void int32[0...,0...]::Set(int32,
                                                           int32,
                                                           int32)
  IL_00b6:  ldloc.s    V_9
  IL_00b8:  ldc.i4.1
  IL_00b9:  add
  IL_00ba:  stloc.s    V_9
  IL_00bc:  ldloc.s    V_9
  IL_00be:  ldloc.1
  IL_00bf:  blt.s      IL_00a9
  IL_00c1:  ldloc.s    V_8
  IL_00c3:  ldc.i4.1
  IL_00c4:  add
  IL_00c5:  stloc.s    V_8
  IL_00c7:  ldloc.s    V_8
  IL_00c9:  ldloc.1
  IL_00ca:  blt.s      IL_00a4
  IL_00cc:  ldloc.0

The IL logic flow is somewhat similar. The main difference I can observe is the first loop manages to use stloc and ldloc for positions 2 and 3 for the first array and the X index variable. By the time it came to the second loop, it had expended the maxstack and thus used the stloc.s and ldloc.s instructions. I believe this is the difference between referencing variables on the stack versus on the heap, and contributing to the slower performance.

Now, if you swap the order in which the loops are tested, so that the Y by X loop gets run first to access stack references, you will see the timing durations get reversed.


UPDATE: I was wrong about referencing stack or heap addresses. It just seems that the first four variables in a method are more efficient to reference with the dedicated stloc.0, 1, 3, 4 and ldloc.0, 1, 3, 4 opcodes.

http://weblogs.asp.net/mnolton/archive/2004/01/09/48992.aspx

icelava
The difference in speed is not really due to the difference in IL, it has to do with the processor cache and the pattern of memory accesses. See my answer for more detail.
Daniel Plaisted
I agree having data prefetched into CPU cache can speed things up. But, what then, is your opinion that my testing of swapping the test sequence around resulted in the Y by X loop running faster than the X by Y loop?
icelava
The X by Y loop accesses data sequentially. The Y by X loop jumps around in memory, accessing these addresses: 1, 10001, 20001, 30001, ... 2, 10002, 20002, 30002, etc
Daniel Plaisted
Erm, still have not answered my question. Why would the Y by X loop run _faster_ than the X by Y loop by me having simply swap the code sequence to run the Y by X loop first?
icelava
He did answer it, accessing the memory in that pattern causes a lot of page faults which are ~10000 times slower than a page hit. Order matters! It's sad really how much work you put into this while still drawing the wrong conclusion. Shame to give you a -1.
Blindy
Looking back at this issue after so long, i *think* i know what is happening. Because the first array, whichever way it was loaded, sets the stage for how the processor is going to cache and *arrange* the data? Therefore, the second array attempt using the opposite dimensional access would not find the next array item in cache, be it column or row.
icelava