views:

575

answers:

3

I have a C array like:

char byte_array[10];

And another one that acts as a mask:

char byte_mask[10];

I would like to do get another array that is the result from the first one plus the second one using a bitwise operation, on each byte.

What's the most efficient way to do this?

thanks for your answers.

+8  A: 
for ( i = 10 ; i-- > 0 ; )
    result_array[i] = byte_array[i] & byte_mask[i];
  • Going backwards pre-loads processor cache-lines.
  • Including the decrement in the compare can save some instructions.

This will work for all arrays and processors. However, if you know your arrays are word-aligned, a faster method is to cast to a larger type and do the same calculation.

For example, let's say n=16 instead of n=10. Then this would be much faster:

uint32_t* input32 = (uint32_t*)byte_array;
uint32_t* mask32 = (uint32_t*)byte_mask;
uint32_t* result32 = (uint32_t*)result_array;
for ( i = 4 ; i-- > 0 ; )
    result32[i] = input32[i] & mask32[i];

(Of course you need a proper type for uint32_t, and if n is not a power of 2 you need to clean up the beginning and/or ending so that the 32-bit stuff is aligned.)

Variation: The question specifically calls for the results to be placed in a separate array, however it would almost certainly be faster to modify the input array in-place.

Jason Cohen
Wait, does the cache prefetcher work better in reverse? I thought it only prefetched going forwards.
Crashworks
Worrying about pre-loading processor cache-lines seems like a severe premature optimization.
Trent
@Trent -- the *point* of the question is optimization. Also going backwards is no slower, so you might as well.@Crashworks -- remember that cache lines are aligned, typically on massive boundaries, so typically it has to pull in bytes prior to the ones you're asking for.
Jason Cohen
Any statements regarding cache is going to be processor specific. I don't see where the OP states what HW this code will execute on.
Trent
@Trent -- you are correct of course, but since it doesn't hurt...
Jason Cohen
I appreciate this explanation. I will use this method. I don't fully understand the cache, so I can't really tell what's going on at that level.
alvatar
Another advantage of going backwards is that it's easier for the CPU to compare the counter to a constant 0 than to compare it with a variable. It avoids a memory access, or frees up a register, depending on if the count is stored in a register.
Adam Rosenfield
Why a power of 2? Wouldn't a multiple of a word size work? You assume 32 bit word here?
Ian Kelling
Nice answer Jason. I would add one other option for the aligned case: use vector operations if the processor supports them. Such as SSE on x86. GCC and Intel C++ both support intrinsics that make it easy to "vectorize" loops like the one above. Google "gcc sse instrinsics" for some good links.
sstock
@Ian - Yes any multiple of word size works, ALSO ASSUMING that the char arrays in question are themselves word-aligned. Also you are right that I'm assuming 32-bit processor; it must be tuned to the processor in question. Although, assuming 32-bit is still faster than byte-wise on almost any proc.
Jason Cohen
+4  A: 

If you want to make it faster, make sure that byte_array has length that is multiple of 4 (8 on 64-bit machines), and then:

char byte_array[12];
char byte_mask[12];
/* Checks for proper alignment */
assert(((unsigned int)(void *)byte_array) & 3 == 0);
assert(((unsigned int)(void *)byte_mask) & 3 == 0);
for (i = 0; i < (10+3)/4; i++) {
  ((unsigned int *)(byte_array))[i] &= ((unsigned int *)(byte_mask))[i];
}

This is much faster than doing it byte per byte.

(Note that this is in-place mutation; if you want to keep the original byte_array also, then you obviously need to store the results in another array instead.)

antti.huima
10/4 == 2, so this only processes 8 chars. In addition, on some non-x86 architectures this may raise a bus error due to unaligned memory accesses.
bk1e
bk1e: you are right, i < 10/4 is wrong. The comment about bus error is also correct. I will edit the answer.
antti.huima
If it is not a multiple of 4/8, use duff's device :)
Brian
+1  A: 

#define CHAR_ARRAY_SIZE (10) #define INT_ARRAY_SIZE ((CHAR_ARRAY_SIZE/ (sizeof (unsigned int)) + 1)

typedef union _arr_tag_ {

char          byte_array [CHAR_ARRAY_SIZE];
unsigned int  int_array [INT_ARRAY_SIZE];

} arr_tag;

Now int_array for masking. This might work for both 32bit and 64 bit processors.

arr_tag arr_src, arr_result, arr_mask;

for (int i = 0; i < INT_ARRAY_SIZE; i ++) {

arr_result.int_array [i] = arr_src.int_array[i] & arr_mask.int_array [i];

}

Try this, code might also look clean.

Alphaneo
Thanks for writing the example code :)
alvatar