views:

565

answers:

5

I have a couple uint8_t arrays in my c code, and I'd like to compare an arbitrary sequence bits from one with another. So for example, I have bitarray_1 and bitarray_2, and I'd like to compare bits 13 - 47 from bitarray_1 with bits 5-39 of bitarray_2. What is the most efficient way to do this?

Currently it's a huge bottleneck in my program, since I just have a naive implementation that copies the bits into the beginning of a new temporary array, and then uses memcmp on them.

A: 

What about writing the function that will calculate the offsets from both arrays, apply the mask, shift the bits and store the result to the int so you may compare them. If the bits count (34 in your example) exceeds the length of the int - recurse or loop.

Sorry, the example will be pain in the ass.

Andrejs Cainikovs
+2  A: 

bits 13 - 47 of bitarray_1 are the same as bits 5 - 39 of bitarray_1 + 1.
Compare the first 3 bits (5 - 7) with a mask and the other bits (8 - 39) with memcmp().

Rather than shift and copy the bits, maybe representing them differently is faster. You have to measure.

/* code skeleton */
static char bitarray_1_bis[BIT_ARRAY_SIZE*8+1];
static char bitarray_2_bis[BIT_ARRAY_SIZE*8+1];
static const char *lookup_table[] = {
    "00000000", "00000001", "00000010" /* ... */
    /* 256 strings */
    /* ... */ "11111111"
};

/* copy every bit of bitarray_1 to an element of bitarray_1_bis */
for (k = 0; k < BIT_ARRAY_SIZE; k++) {
    strcpy(bitarray_1_bis + 8*k, lookup_table[bitarray_1[k]]);
    strcpy(bitarray_2_bis + 8*k, lookup_table[bitarray_2[k]]);
}
memcmp(bitarray_1_bis + 13, bitarray_2_bis + 5, 47 - 13 + 1);

You can (and should) limit the copy to the minimum possible.

I have no idea if it's faster, but it wouldn't surprise me if it was. Again, you have to measure.

pmg
Actually I doubt that bit manipulations will be slower, but your answer is good. Except that lookup_table could be created by malloc and a loop. +1
Andrejs Cainikovs
I looked at the code example and hastily downvoted. Sorry. Now that I look at your first paragraph again, that makes sense, so I undid that vote.I still don't think it's a good idea to create a >64KB lookup table for this exercise :-) But I think the first paragraph is the way to go. +1
asveikau
+1  A: 

The easiest way to do this is to convert the more complex case into a simpler case, then solve the simpler case.

In the following code, do_compare() solves the simpler case (where the sequences are never offset by more than 7 bits, s1 is always offset as much or more than s2, and the length of the sequence is non-zero). The compare_bit_sequence() function then takes care of converting the harder case to the easier case, and calls do_compare() to do the work.

This just does a single-pass through the bit sequences, so hopefully that's an improvement on your copy-and-memcmp implementation.

#define NOT_EQUAL 0
#define EQUAL 1

/* do_compare()
 *
 * Does the actual comparison, but has some preconditions on parameters to
 * simplify things:
 *
 *     length > 0
 *     8 > s1_off >= s2_off
 */
int do_compare(const uint8_t s1[], const unsigned s1_off, const uint8_t s2[],
    const unsigned s2_off, const unsigned length)
{
    const uint8_t mask_lo_bits[] =
        { 0xff, 0x01, 0x03, 0x07, 0x0f, 0x1f, 0x3f, 0x7f, 0xff };
    const uint8_t mask_hi_bits[] =
        { 0x00, 0x80, 0xc0, 0xe0, 0xf0, 0xf8, 0xfc, 0xfe, 0xff };
    const unsigned msb = (length + s1_off - 1) / 8;
    const unsigned s2_shl = s1_off - s2_off;
    const unsigned s2_shr = 8 - s2_shl;
    unsigned n;
    uint8_t s1s2_diff, lo_bits = 0;

    for (n = 0; n <= msb; n++)
    {
        /* Shift s2 so it is aligned with s1, pulling in low bits from
         * the high bits of the previous byte, and store in s1s2_diff */
        s1s2_diff = lo_bits | (s2[n] << s2_shl);

        /* Save the bits needed to fill in the low-order bits of the next
         * byte.  HERE BE DRAGONS - since s2_shr can be 8, this below line
         * only works because uint8_t is promoted to int, and we know that
         * the width of int is guaranteed to be >= 16.  If you change this
         * routine to work with a wider type than uint8_t, you will need
         * to special-case this line so that if s2_shr is the width of the
         * type, you get lo_bits = 0.  Don't say you weren't warned. */  
        lo_bits = s2[n] >> s2_shr;

        /* XOR with s1[n] to determine bits that differ between s1 and s2 */
        s1s2_diff ^= s1[n];

        /* Look only at differences in the high bits in the first byte */
        if (n == 0)
            s1s2_diff &= mask_hi_bits[8 - s1_off];

        /* Look only at differences in the low bits of the last byte */
        if (n == msb)
            s1s2_diff &= mask_lo_bits[(length + s1_off) % 8];

        if (s1s2_diff)
            return NOT_EQUAL;
    }

    return EQUAL;
}

/* compare_bit_sequence()
 *
 * Adjusts the parameters to match the preconditions for do_compare(), then
 *  calls it to do the work.
 */
int compare_bit_sequence(const uint8_t s1[], unsigned s1_off,
    const uint8_t s2[], unsigned s2_off, unsigned length)
{
    /* Handle length zero */
    if (length == 0)
        return EQUAL;

    /* Makes sure the offsets are less than 8 bits */
    s1 += s1_off / 8;
    s1_off %= 8;

    s2 += s2_off / 8;
    s2_off %= 8;

    /* Make sure s2 is the sequence with the shorter offset */
    if (s1_off >= s2_off)
        return do_compare(s1, s1_off, s2, s2_off, length);
    else
        return do_compare(s2, s2_off, s1, s1_off, length);
}

To do the comparison in your example, you'd call:

compare_bit_sequence(bitarray_1, 13, bitarray_2, 5, 35)

(Note that I am numbering the bits from zero, and assuming that the bitarrays are laid out little-endian, so this will start the comparison from the sixth-least-significant bit in bitarray2[0], and the sixth-least-signifcant bit in bitarray1[1]).

caf
ba1[0] = 1; ba2[0] = 3; compare_bit_sequence(ba1, 0, ba2, 1, 7); Seems not to work. do_compare could be probably be simplified again with some more pre-condition such that s1 offset is always 0 and length is always a multiple of 8 (that would be treating boundaries outside loop).
kriss
oups, I meant s2 offset always 0
kriss
kriss: That example works for me (returns EQUAL). There was a bug that I corrected about half an hour ago, where I had `s1_off` and `s2_off` transposed in the calculation of `s2_shl`. You can't change the offsets without altering the source values, which would require writing a copy of the sequence.
caf
@caf: OK, you corrected the bug.You can change the offset without changing the source value provided you change both and length, not source value and boundaries become special cases). Will post some code to show how.
kriss
kriss: Ah, I see what you're driving at - my loop already effectively does that (notice that `s1` isn't ever shifted, and within the loop the `s1_off` value is only used for the LSB and MSB special-cases.
caf
@caf: yes it's nearly done, you just have to take the two if (n == xxx) out of the body of the main loop and reduce loop size. Cost of such unecessary tests inside loop could be quite high for large bit arrays. I also posted my version after my answer.
kriss
I considered that, but in the end I found that handling the case where n = 0 = MSB and s1_off > 0 was easiest putting the tests within the loop. You could remove the `(n == 0)` test by unrolling the first iteration of the loop, and I would hope that a good optimising compiler would minimise the cost of the `(n == msb)` test by combining it with the `n <= msb` test for the for loop.
caf
@caf: both are easily removed unrolling the loop. Don't see how the compiler could do anything : one test two places to go :-(also you comment about dragons is quite surprising. I believe the real reason it's working is because we have an unsigned type. Try using a char instead of a uint8_t and you'll got -1 instead of zero when shifting right by 8 (at least it's what gcc does, didn't checked if this behavior is defined in C99 or C++ standards).
kriss
The messy bit with unrolling them both is when you need to apply the special-case behaviour for the last byte to the first byte (when the stream fits within a single byte) - see my latest comment on yours for a test case that exposes this. Also, unsigned types are required for most of these shifts, but the specific issue I was highlighting with that comment is that a shift is only defined if it is by a value less than the number of bits in the type being shifted - shifting a 16 bit value by 16 (or more) gives an unspecified result.
caf
@caf: never saw that shift was undefined when it is shifting by the number of bits of the type. Where have you seen that ? It's something very common and I don't see the rationale why there should be a restriction here. Don't know when we are shifting by more bits than the type. For the signed/unsigned part, I believe most compilers perform and arithmetic shift right for signed type (duplicate bit sign) and a logical shift for unsigned (insert a zero), but I don't know if it's a normalized behavior or just a common one.
kriss
caf
kriss
+2  A: 

three words: shift, mask and xor.

shift to get the same memory alignment for both bitarray. If not you will have to shift one of the arrays before comparing them. Your exemple is probably misleading because bits 13-47 and 5-39 have the same memory alignment on 8 bits addresses. This wouldn't be true if you were comparing say bits 14-48 with bits 5-39.

Once everything is aligned and exceeding bits cleared for table boundaries a xor is enough to perform the comparison of all the bits at once. Basically you can manage to do it with just one memory read for each array, which should be pretty efficient.

If memory alignment is the same for both arrays as in your example memcmp and special case for upper and lower bound is probably yet faster.

Also accessing array by uint32_t (or uint64_t on 64 bits architectures) should also be more efficient than accessing by uint8_t.

The principle is simple but as Andrejs said the implementation is not painless...

Here is how it goes (similarities with @caf proposal is no coincidence):

/* compare_bit_sequence() */
int compare_bit_sequence(uint8_t s1[], unsigned s1_off, uint8_t s2[], unsigned s2_off,
    unsigned length)
{
const uint8_t mask_lo_bits[] =
    { 0x00, 0x01, 0x03, 0x07, 0x0f, 0x1f, 0x3f, 0x7f, 0xff };
const uint8_t clear_lo_bits[] =
    { 0xff, 0xfe, 0xfc, 0xf8, 0xf0, 0xe0, 0xc0, 0x80, 0x00 };
uint8_t v1;
uint8_t * max_s1;
unsigned end;
uint8_t lsl;
uint8_t v1_mask;
int delta;

/* Makes sure the offsets are less than 8 bits */
s1 += s1_off >> 3;
s1_off &= 7;

s2 += s2_off >> 3;
s2_off &= 7;

/* Make sure s2 is the sequence with the shorter offset */
if (s2_off > s1_off){
    uint8_t * tmp_s;
    unsigned tmp_off;
    tmp_s = s2; s2 = s1; s1 = tmp_s;
    tmp_off = s2_off; s2_off = s1_off; s1_off = tmp_off;
}
delta = s1_off;

/* handle the beginning, s2 incomplete */ 
if (s2_off > 0){
    delta = s1_off - s2_off;
    v1 = delta
       ? (s1[0] >> delta | s1[1] << (8 - delta)) & clear_lo_bits[delta]
       : s1[0];
       if (length <= 8 - s2_off){
           if ((v1 ^ *s2)
                & clear_lo_bits[s2_off]
                & mask_lo_bits[s2_off + length]){
                return NOT_EQUAL;
           }
           else {
               return EQUAL;
           }
        }
        else{
            if ((v1 ^ *s2) & clear_lo_bits[s2_off]){
                return NOT_EQUAL;
        }
        length -= 8 - s2_off;
    }
    s1++;
    s2++;
}

/* main loop, we test one group of 8 bits of v2 at each loop */
max_s1 = s1 + (length >> 3);
lsl = 8 - delta;
v1_mask = clear_lo_bits[delta];
while (s1 < max_s1)
{
    if ((*s1 >> delta | (*++s1 << lsl & v1_mask)) ^ *s2++)
    {
        return NOT_EQUAL;
    }
}

/* last group of bits v2 incomplete */
end = length & 7;
if (end && ((*s2 ^ *s1 >> delta) & mask_lo_bits[end]))
{
    return NOT_EQUAL;
}

return EQUAL;

}

All possible optimisations are not yet used. One promising one would be to use larger chunks of data (64 bits or 32 bits at once instead of 8), you could also detect cases where offset are synchronised for both arrays and in such cases use a memcmp instead of the main loop, replace modulos % 8 by logical operators & 7, replace '/ 8' by '>> 3', etc., have to branches of code instead of swapping s1 and s2, etc, but the main purpose is achieved : only one memory read and not memory write for each array item hence most of the work can take place inside processor registers.

kriss
Now yours is buggy - it fails the same test case that my original one did ;)
caf
@caf: corrected :-) And I have no excuse because I *have* unit tests (that's how I saw the bug in your first version)... just didn't launched them before posting :-(
kriss
Found another bug: `char z1 = 0x82, z2 = 0x06;`, `compare_bit_sequence(` returns `NOT_EQUAL`, should be `EQUAL`.
caf
indeed. Also saw it and fixed it : the case where we are both at beginning and end was not managed in my loop unrolling (and this one was not yet in unit tests). I did some code cleanup too (mostly micro optimizations).
kriss
Hmm, and now I think you can see why I didn't go with the unrolling - it's significantly more complicated. I don't agree with replacing the divides and modulos by shifts and masks, because that's *definitely* something that the compiler will do (GCC does that even with optimisation level zero!).
caf
yes, it's quite complicated and I don't see any obvious simplification. For replacing divide and modulus by masks and shifts, it's not really for optimization. I have an assembly language background and it's easier to read for me. What is fun with that kind of programs where you move bits around is it would probably be simpler to write and easier to read written in assembly language.
kriss
A: 
sambowry