tags:

views:

46

answers:

1

While reading a post on StackOverflow (http://stackoverflow.com/questions/1502081/im-trying-to-optimize-this-c-code-using-4-way-loop-unrolling), which is now marked as closed, I came across an answer (comment, in fact) that said the following: "The two inner loops could possibly get a speed boost by using UInt64 and bit shifting"

Here is the code that was int he post:

char rotate8_descr[] = "rotate8: rotate with 8x8 blocking";

    void rotate8(int dim, pixel *src, pixel *dst) 
    {

    int i, j, ii, jj;

    for(ii = 0; ii < dim; ii += 8)
           for(jj = 0; jj < dim; jj += 8)
                  for (i = ii; i < ii + 8; i++)   
                      for (j = jj; j < jj + 8; j++)
                          dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)];
    }

Could anyone please explain how would that be applied here? I am interested in knowing how to apply bitshifting on this code, or a similar code, and why that would help in performance. Also, how would this code be optimized for cache usage? Any suggestions?

Assume this code was Double Tiled/Blocked (big tile=32, and inside it tiles of 16), and also Loop Invariant Code Motion was applied.. would it still benefit from bitshifting and UInt64?

If not, then what other suggestions would work?

Thanks!

+1  A: 

If the pixels were smaller, you could use 8 Uint64 registers (they are big and there are plenty of them) to cumulate there the result for rotated matrix.

Example for sizeof(pixel) == 1 and little endian machine:

for (int y = 0; y < 8; ++y){
 // for every line, we get 8 pixels from row y into src0.
 // they should go in the last colomn of the result
 // so after 8 iterations they'll get exactly in the 8ht byte 
  Uint64 src0 = *(Uint64*)(src + dim * y);
  dst0 = (dst0 << 8) | ( src0 & 0xff); // this was pixel src[y][0]
  dst1 = (dst1 << 8) | ((src0 >> 8) & 0xff); // and pixel src[y][1]
  etc...
};
// now the 8 dst0..dst7 registers contain rows 0..7 of the result. 
// putting them there
*(Uint64*)(dst) = dst0;
*(Uint64*)(dst + dim) = dst1;
etc..

The good part is that it's easier to unroll and reorder, and there are fewer memory accesses.

ruslik
so you mean with the current size of the "pixel" I can't use this?
johnshaddad
Sure you can, but the benefit could be greater on small pixels. Anyway, if you'll help the compiler to make the memory access in 64bit chunks only on aligned addresses, it will be great. It would be quite inefficient to let it work on unaligned 6 byte structures.
ruslik
Well I kind of get the concept. But could you please elaborate more on how this could be applied in my case? I am kind of lost after the second line.. could you continue the code till the end so that I test and understand the full picture?
johnshaddad
I understand that this reduces memory reading by flushing once, but I want to make sure I understand the concept correctly to absorb it.. Could you apply it on the most inner loop as it should be with the size of "pixel" that I have?
johnshaddad
When you say: dst0 = (dst0 << 8) | ( src0 why are you shifting a newly created Uint64 variable 8 bits to the left, while it is already empty?
johnshaddad
We need the shift in the next 7 iterations, and by making it in the first one we can skip initialization of the dst registers.
ruslik
I see... ok could you modify the post to have all the 8 lines (instead of the etc..)? I just want to make sure I understand it correctly
johnshaddad
Understood.. thanks! I will try to implement and let you know if I have any further questions
johnshaddad
Ok, so you said this could be used for the size of PIXEL of say 6 or 8 bytes. Could you explain how? Because the largest variable possible as I see is uint64, and this fits only 1 PIXEL structure, opposed to 8 pixels I have per row (if we assume size of TILE to be 8). How would I apply this bit-shifting trick on my case then? There is no way I can store more than one pixel in a uint64 variable, and thus I will end up doing 8 memory reads to get the 8 pixels.. plz explain thnx
johnshaddad