ansaurus

Question

What's the most efficient way to make bitwise operations in a C array

Answer 1

+8 A:

for ( i = 10 ; i-- > 0 ; )
    result_array[i] = byte_array[i] & byte_mask[i];

Going backwards pre-loads processor cache-lines.
Including the decrement in the compare can save some instructions.

This will work for all arrays and processors. However, if you know your arrays are word-aligned, a faster method is to cast to a larger type and do the same calculation.

For example, let's say n=16 instead of n=10. Then this would be much faster:

uint32_t* input32 = (uint32_t*)byte_array;
uint32_t* mask32 = (uint32_t*)byte_mask;
uint32_t* result32 = (uint32_t*)result_array;
for ( i = 4 ; i-- > 0 ; )
    result32[i] = input32[i] & mask32[i];

(Of course you need a proper type for uint32_t, and if n is not a power of 2 you need to clean up the beginning and/or ending so that the 32-bit stuff is aligned.)

Variation: The question specifically calls for the results to be placed in a separate array, however it would almost certainly be faster to modify the input array in-place.

Jason Cohen 2009-03-20 22:53:38

Wait, does the cache prefetcher work better in reverse? I thought it only prefetched going forwards.

Crashworks 2009-03-20 22:57:02

Worrying about pre-loading processor cache-lines seems like a severe premature optimization.

Trent 2009-03-20 22:57:46

@Trent -- the *point* of the question is optimization. Also going backwards is no slower, so you might as well.@Crashworks -- remember that cache lines are aligned, typically on massive boundaries, so typically it has to pull in bytes prior to the ones you're asking for.

Jason Cohen 2009-03-20 22:58:46

Any statements regarding cache is going to be processor specific. I don't see where the OP states what HW this code will execute on.

Trent 2009-03-20 22:59:54

@Trent -- you are correct of course, but since it doesn't hurt...

Jason Cohen 2009-03-20 23:00:26

I appreciate this explanation. I will use this method. I don't fully understand the cache, so I can't really tell what's going on at that level.

alvatar 2009-03-20 23:08:11

Another advantage of going backwards is that it's easier for the CPU to compare the counter to a constant 0 than to compare it with a variable. It avoids a memory access, or frees up a register, depending on if the count is stored in a register.

Adam Rosenfield 2009-03-21 01:14:50

Why a power of 2? Wouldn't a multiple of a word size work? You assume 32 bit word here?

Ian Kelling 2009-03-21 03:47:38

Nice answer Jason. I would add one other option for the aligned case: use vector operations if the processor supports them. Such as SSE on x86. GCC and Intel C++ both support intrinsics that make it easy to "vectorize" loops like the one above. Google "gcc sse instrinsics" for some good links.

sstock 2009-03-21 05:57:06

@Ian - Yes any multiple of word size works, ALSO ASSUMING that the char arrays in question are themselves word-aligned. Also you are right that I'm assuming 32-bit processor; it must be tuned to the processor in question. Although, assuming 32-bit is still faster than byte-wise on almost any proc.

Jason Cohen 2009-03-21 19:05:55

Answer 2

+4 A:

If you want to make it faster, make sure that byte_array has length that is multiple of 4 (8 on 64-bit machines), and then:

char byte_array[12];
char byte_mask[12];
/* Checks for proper alignment */
assert(((unsigned int)(void *)byte_array) & 3 == 0);
assert(((unsigned int)(void *)byte_mask) & 3 == 0);
for (i = 0; i < (10+3)/4; i++) {
  ((unsigned int *)(byte_array))[i] &= ((unsigned int *)(byte_mask))[i];
}

This is much faster than doing it byte per byte.

(Note that this is in-place mutation; if you want to keep the original byte_array also, then you obviously need to store the results in another array instead.)

antti.huima 2009-03-20 22:55:58

10/4 == 2, so this only processes 8 chars. In addition, on some non-x86 architectures this may raise a bus error due to unaligned memory accesses.

bk1e 2009-03-24 15:30:56

bk1e: you are right, i < 10/4 is wrong. The comment about bus error is also correct. I will edit the answer.

antti.huima 2009-03-25 11:56:00

If it is not a multiple of 4/8, use duff's device :)

Brian 2009-06-02 02:39:44

Answer 3

+1 A:

#define CHAR_ARRAY_SIZE (10) #define INT_ARRAY_SIZE ((CHAR_ARRAY_SIZE/ (sizeof (unsigned int)) + 1)

typedef union _arr_tag_ {

char          byte_array [CHAR_ARRAY_SIZE];
unsigned int  int_array [INT_ARRAY_SIZE];

} arr_tag;

Now int_array for masking. This might work for both 32bit and 64 bit processors.

arr_tag arr_src, arr_result, arr_mask;

for (int i = 0; i < INT_ARRAY_SIZE; i ++) {

arr_result.int_array [i] = arr_src.int_array[i] & arr_mask.int_array [i];

}

Try this, code might also look clean.

Alphaneo 2009-03-21 01:05:08

Thanks for writing the example code :)

alvatar 2009-03-21 02:45:09

ansaurus

tags:

views:

answers:

What's the most efficient way to make bitwise operations in a C array

related questions