It's likely down to CPU cache behaviour (at 12MB, your images far exceed the 256KB L2 cache in the ARM Cortex A8 that's inside an iphone3gs).
The first example accesses the reading array in sequential order, which is fast, but has to access the writing array out of order, which is slow.
The second example is the opposite - the writing array is written in fast, sequential order and the reading array is accessed in a slower fashion. Write misses are evidently less costly under this workload than read misses.
Ulrich Drepper's article What Every Programmer Should Know About Memory is recommended reading if you want to know more about this kind of thing.
Note that if you have this operation wrapped up into a function, then you will help the optimiser to generate better code if you use the restrict
qualifier on your pointer arguments, like this:
void reorder(uint32_t restrict *buffer1, uint32_t restrict *buffer2)
{
int i = 0;
for (int x = 0; x < width; x++)
for (int y = 0; y < height; y++)
buffer1[x+y*width] = buffer2[i++];
}
(The restrict
qualifier promises the compiler that the data pointed to by the two pointers doesn't overlap - which in this case is necessary for the function to make sense anyway).