for ( i = 10 ; i-- > 0 ; )
result_array[i] = byte_array[i] & byte_mask[i];
- Going backwards pre-loads processor cache-lines.
- Including the decrement in the compare can save some instructions.
This will work for all arrays and processors. However, if you know your arrays are word-aligned, a faster method is to cast to a larger type and do the same calculation.
For example, let's say n=16
instead of n=10
. Then this would be much faster:
uint32_t* input32 = (uint32_t*)byte_array;
uint32_t* mask32 = (uint32_t*)byte_mask;
uint32_t* result32 = (uint32_t*)result_array;
for ( i = 4 ; i-- > 0 ; )
result32[i] = input32[i] & mask32[i];
(Of course you need a proper type for uint32_t
, and if n
is not a power of 2 you need to clean up the beginning and/or ending so that the 32-bit stuff is aligned.)
Variation: The question specifically calls for the results to be placed in a separate array, however it would almost certainly be faster to modify the input array in-place.