views:

1432

answers:

5
A: 

You might be able to improve it by copying in cache-aligned blocks rather than by rows, as at the moment the stride of either src dest will be a miss ( depending whether delphi is row major or column major ).

Pete Kirkham
(Isn't the problem that one of both sides of the = is always unaligned? Either I walk src linearly or dst, but never both at once)
Marco van de Voort
No, you can copy a block and transpose it within the cache as Nils demonstrates ( assuming the machine he's on has 256 byte cache lines )
Pete Kirkham
Ok, sorry, then I got it wrong, yes Nils answer is what I was looking for
Marco van de Voort
A: 

http://stackoverflow.com/questions/42519/how-do-you-rotate-a-two-dimensional-array

Eyal
As far as I can quickly see, none of the answers address cache effects, the trouble I was looking for, and what the accepted answer addresses.
Marco van de Voort
+13  A: 

Yes, there are faster ways to do this.

Your simple loop spends most of the time in cache misses. This happends because you touch a lot of data at very different places in a tight loop. Even worse: Your memory locations are exactly a power of two apart. That's a size where the cache performs worst.

You can improve this rotation algorithm if you improve the locality of your memory accesses.

A simple way to do this would be to rotate each 8x8 pixel block on it's own using the same code you've used for your whole bitmap, and wrap another loop that splits the image rotation into chunks of 8x8 pixels each.

E.g. something like this (not checked, and sorry for the C-code. My Delphi skills aren't up to date):

 // this is the outer-loop that breaks your image rotation
 // into chunks of 8x8 pixels each:
 for (int block_x = 0; block_x < 2048; block_x+=8)
 {
    for (int block_y = 0; blocky_y < 2048; block_y+=8)
    { 
       // this is the inner-loop that processes a block
       // of 8x8 pixels.
       for (int x= 0; x<8; x++)
         for (int y=0; y<8; y++)
            dest[x+block_x][y+block_y] = src[y+block_y][x+block_x]
    }
 }

There are other ways as well. You could process the data in Hilbert-Order or Morton-Order. That would be in theory even a bit faster, but the code will be much more complex.

Btw - Since you've mentioned that SSE is an option for you. Note that you can rotate a 8x8 byte block within the SSE-registers. It's a bit tricky to get it working, but looking at SSE matrix transpose code should get you started as it's the same thing.


EDIT:

Just checked:

With a block-size of 8x8 pixels the code runs ca. 5 times faster on my machine. With a block-size of 16x16 it runs 10 times faster.

Seems like it's a good idea to experiment with different block-sizes.

Here is the (very simple) test-program I've used:

#include <stdio.h>
#include <windows.h>

char temp1[2048*2048];
char temp2[2048*2048];

void rotate1 (void)
{
  int x,y;
  for (y=0; y<2048; y++)
  for (x=0; x<2048; x++)
    temp2[2048*y+x] = temp1[2048*x+y];
}

void rotate2 (void)
{
  int x,y;
  int bx, by;

  for (by=0; by<2048; by+=8)
  for (bx=0; bx<2048; bx+=8)
  for (y=0; y<8; y++)
  for (x=0; x<8; x++)
    temp2[2048*(y+by)+x+bx] = temp1[2048*(x+bx)+y+by];
}

void rotate3 (void)
{
  int x,y;
  int bx, by;

  for (by=0; by<2048; by+=16)
  for (bx=0; bx<2048; bx+=16)
  for (y=0; y<16; y++)
  for (x=0; x<16; x++)
    temp2[2048*(y+by)+x+bx] = temp1[2048*(x+bx)+y+by];
}


int main (int argc, char **args)
{
  int i, t1;

  t1 = GetTickCount();
  for (i=0; i<20; i++) rotate1();
  printf ("%d\n", GetTickCount()-t1);

  t1 = GetTickCount();
  for (i=0; i<20; i++) rotate2();
  printf ("%d\n", GetTickCount()-t1);

  t1 = GetTickCount();
  for (i=0; i<20; i++) rotate3();
  printf ("%d\n", GetTickCount()-t1);

}
Nils Pipenbrinck
Ok, thank you. That sounds very promising. A 10 times speed up would be enough for now, something I would not bother messing with SSE for. At least not now.I had a rough feeling about doing it in small blocks, and this both confirms it and provides me with an implementation.
Marco van de Voort
The technique is called loop-tiling btw. Oh - and don't forget that you can parallelize the scaling. Just start two threads and let each one handle half of the image.
Nils Pipenbrinck
Naah, already have a core dedicated to each camera ;-)
Marco van de Voort
Btw Marco, Once you've implemented it, let us know how much of a speed up you had. I'm just curious how it performs in a real-world application.
Nils Pipenbrinck
Done, I updated the question
Marco van de Voort
A: 

If the image isn't square, you can't do in-place. Even if you work in square images, the transform isn't conducive to in-place work.

If you want to try to do things a little faster, you can try to take advantage of the row strides to make it work, but I think the best you would do is to read 4 bytes at a time in a long from the source and then write it into four consecutive rows in the dest. That should cut some of your overhead, but I wouldn't expect more than a 5% improvement.

plinth
Correct, but I have several different use cases, and could be inplace, e.g. by enlarging the image lightly so that it is square. It is one of the more minor usecases though.As you can see below, the speedup is quite dramatic (300-400%).But inplace would cost speed since you can't simply exchange blocks, but would have to rotate through 4 x 90 degrees, so I decided not to push that and take the copy for granted now.
Marco van de Voort
+1  A: 

If you can use C++ then you may want to look at Eigen.

It is a C++ template library that uses SSE (2 and later) and AltiVec instruction sets with graceful fallback to non-vectorized code.

Fast. (See benchmark).
Expression templates allow to intelligently remove temporaries and enable lazy evaluation, when that is appropriate -- Eigen takes care of this automatically and handles aliasing too in most cases.
Explicit vectorization is performed for the SSE (2 and later) and AltiVec instruction sets, with graceful fallback to non-vectorized code. Expression templates allow to perform these optimizations globally for whole expressions.
With fixed-size objects, dynamic memory allocation is avoided, and the loops are unrolled when that makes sense.
For large matrices, special attention is paid to cache-friendliness.

lothar
The problem with such libraries is that the transformation of the format you have to the format the library uses, is usually already in the magnitude of the possible gain.
Marco van de Voort