views:

355

answers:

7

I need to rewrite about 4KB of data in reverse order, at bit level (last bit of last byte becoming first bit of first byte), as fast as possible. Are there any clever sniplets to do it?

Rationale: The data is display contents of LCD screen in an embedded device that is usually positioned in a way that the screen is on your shoulders level. The screen has "6 o'clock" orientation, that is to be viewed from below - like lying flat or hanging above your eyes level. This is fixable by rotating the screen 180 degrees, but then I need to reverse the screen data (generated by library), which is 1 bit = 1 pixel, starting with upper left of the screen. The CPU isn't very powerful, and the device has enough work already, plus several frames a second would be desirable so performance is an issue; RAM not so much.

edit: Single core, ARM 9 series. 64MB, (to be scaled down to 32MB later), Linux. The data is pushed from system memory to the LCD driver over 8-bit IO port.

The CPU is 32bit and performs much better at this word size than at byte level.

+3  A: 

Loop through the half of the array, convert and exchange bytes.

for( int i = 0; i < arraySize / 2; i++ ) {
    char inverted1 = invert( array[i] );
    char inverted2 = invert( array[arraySize - i - 1] );
    array[i] = inverted2;
    array[arraySize - i - 1] = inverted1;
}

For conversion use a precomputed table - an array of 2CHAR_BIT (CHAR_BIT will most likely be 8) elements where at position "I" the result of byte with value "I" inversion is stored. This will be very fast - one pass - and consume only 2CHAR_BIT for the table.

sharptooth
+7  A: 

The Bit Twiddling Hacks site is alwas a good starting point for these kind of problems. Take a look here for fast bit reversal. Then its up to you to apply it to each byte/word of your memory block.

EDIT:

Inspired by Dietrich Epps answer and looking at the ARM instruction set, there is a RBIT opcode that reverses the bits contained in a register. So if performance is critical, you might consider using some assembly code.

Frank Bollack
The URL for Bit Twiddling hacks is broken - it should be: http://graphics.stanford.edu/~seander/bithacks.html
Paul R
Thanks for the hint, fixed it...
Frank Bollack
+1  A: 

Single Core?

How much memory?

Is the display buffered in memory and pushed to the device, or is the only copy of the pixels in the screens memory?

Spence
Single core, ARM 9 series. 64MB, (to be scaled down to 32MB later), Linux. The memory is pushed from system memory (which I control) to the LCD driver memory (outside of my control).
SF.
+12  A: 

Fastest way would probably to store the reverse of all possible byte values in a look-up table. The table would take only 256 bytes.

Maurits Rijk
...then just change the loop that writes the bytes out to the IO port to start at the end of the buffer and work backwards.
caf
+2  A: 

To reverse a single byte x you can handle the bits one at a time:

unsigned char a = 0;
for (i = 0; i < 8; ++i) {
   a += (unsigned char)(((x >> i) & 1) << (7 - i));
}

You can create a cache of these results in an array so that you can quickly reverse a byte just by making a single lookup instead of looping.

Then you just have to reverse the byte array, and when you write the data apply the above mapping. Reversing a byte array is a well documented problem, e.g. here.

Mark Byers
+10  A: 

Build a 256 element lookup table of byte values that are bit-reversed from their index.

{0x00, 0x80, 0x40, 0xc0, etc}

Then iterate through your array copying using each byte as an index into your lookup table.

If you are writing assembly language, the x86 instruction set has an XLAT instruction that does just this sort of lookup. Although it may not actually be faster than C code on modern processors.

You can do this in place if you iterate from both ends towards the middle. Because of cache effects, you may find it's faster to swap in 16 byte chunks (assuming a 16 byte cache line).

Here's the basic code (not including the cache line optimization)

// bit reversing lookup table
typedef unsigned char BYTE;
extern const BYTE g_RevBits[256];

void ReverseBitsInPlace(BYTE * pb, int cb)
{
    int iter = cb/2;
    for (int ii = 0, jj = cb-1; ii < iter; ++ii, --jj)
    {
        BYTE b1 = g_RevBits[pb[ii]];
        pb[ii] = g_RevBits[pb[jj]];
        pb[jj] = b1;
    }

    if (cb & 1) // if the number of bytes was odd, swap the middle one in place
    {
       pb[cb/2] = g_RevBits[pb[cb/2]];
    }
}

// initialize the bit reversing lookup table using macros to make it less typing.
#define BITLINE(n) \
   0x0##n, 0x8##n, 0x4##n, 0xC##n, 0x2##n, 0xA##n, 0x6##n, 0xE##n,\
   0x1##n, 0x9##n, 0x5##n, 0xD##n, 0x3##n, 0xB##n, 0x7##n, 0xF##n,

const BYTE g_RevBits[256] = {
  BITLINE(0), BITLINE(8), BITLINE(4), BITLINE(C), 
  BITLINE(2), BITLINE(A), BITLINE(6), BITLINE(E), 
  BITLINE(1), BITLINE(9), BITLINE(5), BITLINE(D), 
  BITLINE(3), BITLINE(B), BITLINE(7), BITLINE(F), 
  };
John Knoeller
+16  A: 

There's a classic way to do this. Let's say unsigned int is your 32-bit word. I'm using C99 because the restrict keyword lets the compiler perform extra optimizations in this speed-critical code that would otherwise be unavailable. These keywords inform the compiler that "src" and "dest" do not overlap. This also assumes you are copying an integral number of words, if you're not, then this is just a start.

I also don't know which bit shifting / rotation primitives are fast on the ARM and which are slow. This is something to consider. If you need more speed, consider disassembling the output from the C compiler and going from there. If using GCC, try O2, O3, and Os to see which one is fastest. You might reduce stalls in the pipeline by doing two words at the same time.

This uses 23 operations per word, not counting load and store. However, these 23 operations are all very fast and none of them access memory. I don't know if a lookup table would be faster or not.

void
copy_rev(unsigned int *restrict dest,
         unsigned int const *restrict src,
         unsigned int n)
{
    unsigned int i, x;
    for (i = 0; i < n; ++i) {
        x = src[i];
        x = (x >> 16) | (x << 16);
        x = ((x >> 8) & 0x00ff00ffU) | ((x & 0x00ff00ffU) << 8);
        x = ((x >> 4) & 0x0f0f0f0fU) | ((x & 0x0f0f0f0fU) << 4);
        x = ((x >> 2) & 0x33333333U) | ((x & 0x33333333U) << 2);
        x = ((x >> 1) & 0x55555555U) | ((x & 0x555555555) << 1);
        dest[n-1-i] = x;
    }
}

This page is a great reference: http://graphics.stanford.edu/~seander/bithacks.html#BitReverseObvious

Final note: Looking at the ARM assembly reference, there is a "REV" opcode which reverses the byte order in a word. This would shave 7 operations per loop off the above code.

Dietrich Epp