You might be able to improve it by copying in cache-aligned blocks rather than by rows, as at the moment the stride of either src dest will be a miss ( depending whether delphi is row major or column major ).
Yes, there are faster ways to do this.
Your simple loop spends most of the time in cache misses. This happends because you touch a lot of data at very different places in a tight loop. Even worse: Your memory locations are exactly a power of two apart. That's a size where the cache performs worst.
You can improve this rotation algorithm if you improve the locality of your memory accesses.
A simple way to do this would be to rotate each 8x8 pixel block on it's own using the same code you've used for your whole bitmap, and wrap another loop that splits the image rotation into chunks of 8x8 pixels each.
E.g. something like this (not checked, and sorry for the C-code. My Delphi skills aren't up to date):
// this is the outer-loop that breaks your image rotation
// into chunks of 8x8 pixels each:
for (int block_x = 0; block_x < 2048; block_x+=8)
{
for (int block_y = 0; blocky_y < 2048; block_y+=8)
{
// this is the inner-loop that processes a block
// of 8x8 pixels.
for (int x= 0; x<8; x++)
for (int y=0; y<8; y++)
dest[x+block_x][y+block_y] = src[y+block_y][x+block_x]
}
}
There are other ways as well. You could process the data in Hilbert-Order or Morton-Order. That would be in theory even a bit faster, but the code will be much more complex.
Btw - Since you've mentioned that SSE is an option for you. Note that you can rotate a 8x8 byte block within the SSE-registers. It's a bit tricky to get it working, but looking at SSE matrix transpose code should get you started as it's the same thing.
EDIT:
Just checked:
With a block-size of 8x8 pixels the code runs ca. 5 times faster on my machine. With a block-size of 16x16 it runs 10 times faster.
Seems like it's a good idea to experiment with different block-sizes.
Here is the (very simple) test-program I've used:
#include <stdio.h>
#include <windows.h>
char temp1[2048*2048];
char temp2[2048*2048];
void rotate1 (void)
{
int x,y;
for (y=0; y<2048; y++)
for (x=0; x<2048; x++)
temp2[2048*y+x] = temp1[2048*x+y];
}
void rotate2 (void)
{
int x,y;
int bx, by;
for (by=0; by<2048; by+=8)
for (bx=0; bx<2048; bx+=8)
for (y=0; y<8; y++)
for (x=0; x<8; x++)
temp2[2048*(y+by)+x+bx] = temp1[2048*(x+bx)+y+by];
}
void rotate3 (void)
{
int x,y;
int bx, by;
for (by=0; by<2048; by+=16)
for (bx=0; bx<2048; bx+=16)
for (y=0; y<16; y++)
for (x=0; x<16; x++)
temp2[2048*(y+by)+x+bx] = temp1[2048*(x+bx)+y+by];
}
int main (int argc, char **args)
{
int i, t1;
t1 = GetTickCount();
for (i=0; i<20; i++) rotate1();
printf ("%d\n", GetTickCount()-t1);
t1 = GetTickCount();
for (i=0; i<20; i++) rotate2();
printf ("%d\n", GetTickCount()-t1);
t1 = GetTickCount();
for (i=0; i<20; i++) rotate3();
printf ("%d\n", GetTickCount()-t1);
}
If the image isn't square, you can't do in-place. Even if you work in square images, the transform isn't conducive to in-place work.
If you want to try to do things a little faster, you can try to take advantage of the row strides to make it work, but I think the best you would do is to read 4 bytes at a time in a long from the source and then write it into four consecutive rows in the dest. That should cut some of your overhead, but I wouldn't expect more than a 5% improvement.
If you can use C++ then you may want to look at Eigen.
It is a C++ template library that uses SSE (2 and later) and AltiVec instruction sets with graceful fallback to non-vectorized code.
Fast. (See benchmark).
Expression templates allow to intelligently remove temporaries and enable lazy evaluation, when that is appropriate -- Eigen takes care of this automatically and handles aliasing too in most cases.
Explicit vectorization is performed for the SSE (2 and later) and AltiVec instruction sets, with graceful fallback to non-vectorized code. Expression templates allow to perform these optimizations globally for whole expressions.
With fixed-size objects, dynamic memory allocation is avoided, and the loops are unrolled when that makes sense.
For large matrices, special attention is paid to cache-friendliness.