views:

207

answers:

12

I have written a function which reads an input buffer of bytes and produces an output buffer of words where every word can be either 0x0081 for each ON bit of the input buffer or 0x007F for each OFF bit. The length of the input buffer is given. Both arrays have enough physical place. I also have about 2Kbyte free RAM which I can use for lookup tables or so.

Now, I found that this function is my bottleneck in a real time application. It will be called very frequently. Can you please suggest a way how to optimize this function? I see one possibility could be to use only one buffer and do in-place substitution.

void inline BitsToWords(int8    *pc_BufIn, 
                        int16   *pw_BufOut, 
                        int32   BufInLen)
{
 int32 i,j,z=0;

 for(i=0; i<BufInLen; i++)
 {
  for(j=0; j<8; j++, z++)
  {
   pw_BufOut[z] = 
                    ( ((pc_BufIn[i] >> (7-j))&0x01) == 1? 
                    0x0081: 0x007f );
  }
 }
}

Please do not offer any library-, compiler specific or CPU/Hardware specific optimization, because it is a multi-platform project.

A: 

I was going to suggest a boost::for_each since it would unravel the loop but the end isn't known. Best I think you'll get is to unravel the inner loop. I'd look for ways to do that. boost::for_each over an mpl::range may be an option there.

Noah Roberts
We don't use boost library.
psihodelia
sucks to be you.
Noah Roberts
A: 

What comes to mind immediately:

  • Unroll the inner loop (the compiler might do so already, but you can optimize further if you do so manually, see below)
  • Instead of keeping "z", keep a pointer that is incremented (the compiler might do that already)
  • Instead of performing a comparison, for each unrolled item, shift down the shift you extracted so that it's in the second place. Add this to 0x7f and put that in the value. That will give you 0x7F or 0x81 for each.

Best thing is to see what sort of assembler that is generated for your target platforms, and see what the compiler is doing.

EDIT: I wouldn't use a lookup table. The cost of the extra cache misses will probably be more than the cost of the simple calculation.

EDIT2: Let me get to the other computer and fire up the compiler, and I'll see what I can do.

arke
can you elaborate on the cache miss assumption? there are only 8 entries for the table...
fabrizioM
256 entries, one for each possible byte value. If you make each entry one byte (you don't need more than that), you have 256 bytes. Depending on the platform, that might be 2-4 cache lines that are "blocked".
arke
@arke: It's actually worse, because each entry in the byte-indexed table needs to contain 8 values (one `0x7f` or `0x81` for each bit in the index). So it's either 4096 bytes or 2048 bytes, depending on when you store `int16` or `int8` values in the table.
caf
A: 

You can extract pc_BufIn[i] into the outer loop. Also at first glance, when counting backwards in the second loop, you can skip the 7-j calculation.

jopa
A: 

I might suggest creating a lookup table of the 8 possible single bit masks (i.e. 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80) and then use these to compare to your bitfield in a loop. Pseudocode (bitmasks above called bitmask, in the appropriate order):

for(i=0,i<BufInLen;i++)
  for(j=0;j<8;j++,z++)
    pw_BufOut[z]=(pc_BufIn[i]&bitmask[j])==0?0x007f:0x0081;
VeeArr
A: 

If you don't mind having 256 pw_Bufout in memory you could try to generate all the possible outputs and skip the second loop by changing that to pw_BufOut[i]=perm[pc_BufIn[i]]; (perm is an array with all the permutations)

kiwi
+2  A: 

I assume you can't use parellellism?


This is only a guess - you really need to be guided by a profiler - but I think lookup tables could work.

If I understand correctly, each byte in the input array produces 16 bytes in the output. So a lookup table that gives the 16 byte output for a single byte input should take 4KiB - which is more than you have to spare.

You could split each byte into two parts of 4 bits instead, which would reduce the size of the requried table to 256bytes:

int16[0x0F][4] values = {...};
void inline BitsToWords(int8    *pc_BufIn, int16   *pw_BufOut, int32   BufInLen)
{  
  for(int32 i=0; i<BufInLen; ++i, BufOut+=8)
  {
    memcpy(pw_BufOut,values[pc_BufIn[i]&0x0F]);
    memcpy(pw_BufOut+4,values[(pc_BufIn[i]&0xF0)>>4]);
  }
}

Also, if you're finding that the loop overhead is excessive, you could use a Duff's Device.

Joe Gauterin
You should AVOID using Duff's Device. Yes, it's a cute clever trick, but it's easy to screw up, and it isn't going to be faster than memcpy() on modern hardware, because memcpy() will be optimized to take advantage of any hardware block move/copy instructions. If there aren't any specialized instructions, then I'd expect memcpy() to use Duff's Device itself.
Craig Trader
From the limitations listed in the question, it's reasonable to assume that it will be running on an embedded platform or older hardware - in which case a Duff's Device might make sense.
Joe Gauterin
A: 

First off, since you are bit twiddling, change everything to unsigned. This eliminates any adverse effects due to sign extension or other sign related operations.

You could use a modified Duff's device:

void inline BitsToWords(int8    *pc_BufIn, 
                        int16   *pw_BufOut, 
                        int32   BufInLen)
{
    uint32 i,j,z=0;

    for(i=0; i<BufInLen; i++)
    {
        uint8   byte = pc_BufIn[i];
        for (j = 0; j < 2; ++j)
        {
            switch (byte & 0x0F)
            {
                case 0:     // 0000 binary
                    pw_BufOut[z++] = 0x7F;
                    pw_BufOut[z++] = 0x7F;
                    pw_BufOut[z++] = 0x7F;
                    pw_BufOut[z++] = 0x7F;
                    break;
                case 1:     // 0001 binary
                    pw_BufOut[z++] = 0x7F;
                    pw_BufOut[z++] = 0x7F;
                    pw_BufOut[z++] = 0x7F;
                    pw_BufOut[z++] = 0x81;
                    break;
                case 2:     // 0010 binary
                    pw_BufOut[z++] = 0x7F;
                    pw_BufOut[z++] = 0x7F;
                    pw_BufOut[z++] = 0x81;
                    pw_BufOut[z++] = 0x7F;
                    break;

               // And so on ...
                case 15:        // 1111 binary
                    pw_BufOut[z++] = 0x81;
                    pw_BufOut[z++] = 0x81;
                    pw_BufOut[z++] = 0x81;
                    pw_BufOut[z++] = 0x81;
                    break;
            } // End: switch
            byte >>= 1;
        }
    }
}
Thomas Matthews
That's even slower, because now instead of a simple "yes/no" branch in the inner loop you need to compare to branch.btw, that's not a duff's device.
arke
Actually, in the OP's loop, there are 8 comparisons for each byte, or at least 8 branches for each byte (due to the `if` statement). I reduced the inner loop to 2 branches per byte. And the `switch` statement is a jump table, so it is an indexed jump. Compile this, optimized for speed. This design sped up my embedded system at least 30% from using an `if` statement in the loop. Profile it.
Thomas Matthews
I'd agree that this is something that's probably worth trying. I'd also change from using indexes to using pointer increments - not just because I think compilers would probably have an easier time optimizing it, but I think it's more idiomatic and readable. Oh, and change the `byte >>= 1` to `byte >>= 4`.
Michael Burr
+1  A: 

On the assumption that pc_bufIn and pw_bufOut point to non-overlapping memory regions, I can think of several optimizations off the top of my head. The first is that you can declare the pointers to be non-aliased:

void inline BitsToWords(int8  * restrict pc_BufIn, 
                        int16 * restrict pw_BufOut, 
                        int32            BufInLen)

This will allow the compiler to perform optimizations that otherwise wouldn't be permitted. Note that your compiler may use a different keyword; I think some use __restrict__ or may have a compiler-specific attribute. Note that the only requirement is that pc_bufIn and pw_bufOut do not overlap. This should give you an immediate performance speedup, since the compiler will not attempt to re-read pc_bufIn whenever pw_bufOut is written out to (saving you 7 reads for every 8 writes).

If that keyword is not available, an alternative optimization is possible:

{
 char* bufInEnd = pc_bufIn + BufInLen;
 While(pc_bufIn != bufInEnd) {
 {
  char tmp = *pc_bufIn++;
  for(int j=0; j<8; j++)
  {
   *pw_BufOut++ =  ( (tmp & (0x80 >> j) != 0)? 
                    0x0081: 0x007f );
  }
 }
}

The above slight rewrite is, to me, easier to follow as it states unequivocally the path the CPU takes, but I hope the optimization is obvious: Store the value at pc_bufIn[i] to a temporary local variable, instead of hitting the pointer every iteration of the inner loop.

Another, less obvious optimization would utilize the increasingly common vector hardware available on most CPUs (including ARM's NEON and Intel's SSE) to synthesize the result 16 bytes at a time. I'd recommend investigating that option.

greyfade
I overlooked the using the new-ish `restrict` keyword. glibc defines `__restrict` for use if `restrict` isn't available, though I think it does nothing if `restrict` isn't available.
nategoose
If `restrict` isn't available, either it's a really old compiler (upgrade!) or there's probably an attribute or `#pragma` you can abuse.
greyfade
+2  A: 

First attempt:

void inline BitsToWords(int8    *pc_BufIn,  
                        int16   *pw_BufOut,  
                        int32   BufInLen) 
{ 
 int32 i,j=0;
 int8 tmp;
 int16 translate[2] = { 0x007f, 0x0081 };

 for(i=0; i<BufInLen; i++) 
 { 
  tmp = pc_BufIn[i];
  for(j=0x80; j!=0; j>>=1) 
  { 
   *pw_BufOut++ = translate[(tmp & j) != 0];
  } 
 } 
} 

Second attempt, stealing shamelessly from Michael Burr (who already got a +1 from me):

void inline BitsToWords(int8    *pc_BufIn,  
                        int16   *pw_BufOut,  
                        int32   BufInLen) 
{ 
    while (BufInLen--) { 
        int16 tmp = *pc_BufIn++; 

        *pw_BufOut++ = 0x007f + ((tmp >> 6) & 0x02); 
        *pw_BufOut++ = 0x007f + ((tmp >> 5) & 0x02); 
        *pw_BufOut++ = 0x007f + ((tmp >> 4) & 0x02); 
        *pw_BufOut++ = 0x007f + ((tmp >> 3) & 0x02); 
        *pw_BufOut++ = 0x007f + ((tmp >> 2) & 0x02); 
        *pw_BufOut++ = 0x007f + ((tmp >> 1) & 0x02); 
        *pw_BufOut++ = 0x007f + (tmp & 0x02); 
        *pw_BufOut++ = 0x007f + ((tmp << 1) & 0x02);  
    } 
} 
Mark Ransom
by removing the inner loop (in the same way Neil has done), I'd be interested to know how using array access compares with the ternary operator form?
James Morris
If your lookup table fits inside a cache row for your processor, then it's definitely going to be faster; if not, then it's still probably going to be faster, since the the ternary form is still just a short cut for an if-then-else statement.
Craig Trader
A lot depends on your processor. I know x86 has an instruction for converting !=0 to (0,1) directly; I'm not sure about other architectures. Also the result needs to be multiplied by 2 to index the array memory, again the x86 does this directly.
Mark Ransom
+6  A: 

I also have about 2Kbyte free RAM which I can use for lookup tables

Your lookup tables can placed in a const array at compile time, so it could be in ROM - does this give you room for the straightforward 4KB table?

If you can afford 4KB of ROM space, the only problem is building the table as an initialized array in a .c file - but that only has to be done once, and you can write a script to do it (which may help ensure it's correct and may also help if you decide that the table needs to change for some reason in the future).

You'd have to profile to ensure that the copy from ROM to the destination array is actually faster than calculating what needs to go into the destination - I wouldn't be surprised if something along the lines of:

/* untested code - please forgive any bonehead errors */
void inline BitsToWords(int8    *pc_BufIn, 
                        int16   *pw_BufOut, 
                        int32   BufInLen)
{
    while (BufInLen--) {
        unsigned int tmp = *pc_BufIn++;

        *pw_BufOut++ = (tmp & 0x80) ? 0x0081 : 0x007f;
        *pw_BufOut++ = (tmp & 0x40) ? 0x0081 : 0x007f;
        *pw_BufOut++ = (tmp & 0x20) ? 0x0081 : 0x007f;
        *pw_BufOut++ = (tmp & 0x10) ? 0x0081 : 0x007f;
        *pw_BufOut++ = (tmp & 0x08) ? 0x0081 : 0x007f;
        *pw_BufOut++ = (tmp & 0x04) ? 0x0081 : 0x007f;
        *pw_BufOut++ = (tmp & 0x02) ? 0x0081 : 0x007f;
        *pw_BufOut++ = (tmp & 0x01) ? 0x0081 : 0x007f; 
    }
}

ends up being faster. I'd expect that an optimized build of that function would have everything in registers or encoded into the instructions except for a single read of each input byte and a single write of each output word. Or pretty close to that.

You might be able to further optimize by acting on more than one input byte at a time, but then you have to deal with alignment issues and how to handle input buffers that aren't a multiple of the chunk size you're dealing with. Those aren't problems that are too hard to deal with, but they do complicate things, and it's unclear what kind of improvement you might be able to expect.

Michael Burr
Your tmp must be int8 , not unsigned int.
psihodelia
+1  A: 

If you're going for raw speed, then using a lookup table (to avoid the inner loop with bit shifts) is probably the best approach.

static int16 [] lookup = {
  0x007f, 0x007f, 0x007f, 0x007f, 0x007f, 0x007f, 0x007f, 0x007f, 
  0x007f, 0x007f, 0x007f, 0x007f, 0x007f, 0x007f, 0x007f, 0x0081, 
  0x007f, 0x007f, 0x007f, 0x007f, 0x007f, 0x007f, 0x0081, 0x007f, 
  0x007f, 0x007f, 0x007f, 0x007f, 0x007f, 0x007f, 0x0081, 0x0081,
  /* skip 251 entries */
  0x0081, 0x0081, 0x0081, 0x0081, 0x0081, 0x0081, 0x0081, 0x0081, 
};

void inline BitsToWords(int8 * input, int16 * output, int32 length) {
  while ( length-- ) {
    memcpy( output, lookup[ *input++ ], 16 );
    output += 8; 
  }
}

The problem there is that the lookup table itself would be 4KB (256*16), which is larger than you have available. That can be worked around in one of two ways. The simplest and smallest solution would be something like this:

static int16 [] lookup = {
  0x007f, 0x007f, 0x007f, 0x007f, 
  0x007f, 0x007f, 0x007f, 0x0081, 
  0x007f, 0x007f, 0x0081, 0x007f, 
  0x007f, 0x007f, 0x0081, 0x0081,
  /* skip 11 entries */
  0x0081, 0x0081, 0x0081, 0x0081, 
};

void inline BitsToWords(int8 * input, int16 * output, int32 length) {
  while ( length-- ) {
    int 8 c = *input++;
    memcpy( output, &lookup[ c &0x0f ], 8 );
    memcpy( output+4, &lookup[ c >> 4 ], 8 );
    output += 8; 
  }
}

The more complex, but possibly faster way would be to use a De Bruijn sequence to encode all of the possible lookup values. This would reduce the lookup table from 4KB to 512+14, but would require an additional level of indirection and another index table (256 bytes), for a total of 782 bytes. This would remove one of the memcpy() calls, as well as the shift and the bitwise and, at the cost of one more index. Probably not necessary in your case, but interesting all the same.

Craig Trader
You can halve the size of the lookup table by using `int8` for it, but the tradeoff is that requires using 8 assignments rather than a `memcpy()`.
caf
Actually, unless the target memory is already initialized (not necessarily a valid assumption) you'd need to do 8 copies and 8 assignments (zero x+0, copy x+1, zero x+2, copy x+3, ...).
Craig Trader
A: 

Firstly, you're doing this for an 8 segment display, aren't you?

You may want to

#include <stdint.h>

It contains typedefs for sized integers with names like uint8_t and uint_fast8_t. Your types serve similar purposes to the first form, but the fast versions can be bigger if the target processor works better with data of that size. You probably won't want to change your array types, though; mostly just your local variable types.

void inline BitsToWords(int8    *pc_BufIn, 
                        int16   *pw_BufOut, 
                        int32   BufInLen)
{
  //int32 i,j,z=0;
  /* This is a place you might want to use a different type, but
   * I don't know for sure.  It depends on your processor, and I
   * didn't use these variables */

  int8 * end = pc_BufIn + BufInLen; /* So that you can do pointer math rather than
                                    * index. */
  while (end < pc_BufIn)
  {
    uint_fast8_t cur = *(pc_BufIn++);
    uint_fast8_t down = 8;

    do
    {
       *(pw_BufOut++) = 0x07f + ( (mask&cur)<< 1 ); /* When the bottom bit is set, add 2 */
       /* By doing this with addition we avoid a jump. */

       cur >>= 1; /* next smallest bit */
    } while (--down);
  }
}

In this code I changed the order of the second loop to count down rather than up. This is often more efficient if your lower limit is 0 or -1. Also, it seemed like you were going from the most significant bit to the least anyway.

Alternately you could unroll the inner loop and produce faster code and do away with the down variable. Your compiler may already be doing this for you.

*(pw_BufOut++) = 0x07f + ( (mask&cur)<< 1 );
cur >>= 1; /* next smallest bit */
*(pw_BufOut++) = 0x07f + ( (mask&cur)<< 1 );
cur >>= 1; /* next smallest bit */
*(pw_BufOut++) = 0x07f + ( (mask&cur)<< 1 );
cur >>= 1; /* next smallest bit */
*(pw_BufOut++) = 0x07f + ( (mask&cur)<< 1 );
cur >>= 1; /* next smallest bit */
*(pw_BufOut++) = 0x07f + ( (mask&cur)<< 1 );
cur >>= 1; /* next smallest bit */
*(pw_BufOut++) = 0x07f + ( (mask&cur)<< 1 );
cur >>= 1; /* next smallest bit */
*(pw_BufOut++) = 0x07f + ( (mask&cur)<< 1 );
cur >>= 1; /* next smallest bit */
*(pw_BufOut++) = 0x07f + ( (mask&cur)<< 1 );

For the outer loop I changed it to just increment a pointer rather than use array[index] and index test as your conditional. Many processors can actually do the pointer+offset for you, and on those processors the pointer++ method may not be a win for you. In that case I would suggest that you might try reversing the outer loop and counting down your index until index < 0. Attempting to decrement it just before the test will often result in the same flags being set as explicitly testing the value against 0, and compilers usually take advantage of that when optimizations are turned on.

Another thing that you may want to try is to use bigger chunks than bytes as your input. You would have to worry about endian issues and non-word sized input arrays.

One more thing you may want to consider is not doing this operation for an entire variable length string at one time. You could do it one input byte or one word per call and then pass that 8 * 16 chunk of memory to something else (a piece of hardware, I assume). Then you may be able to reduce the memory requirements for your output array, which may improve cache performance.

nategoose