tags:

views:

152

answers:

5

I would like to use binary flags to represent a mathematical set in C, where "Bit i is set" means "item i is in the Set". This is convenient because operations like "union" and "intersection" are trivial to implement ("|" and "&"). However, I want my set to be able to contain more than 32 items. Furthermore, I want my code to work on both 32 and 64 bit machines.

Is there any simple way to manipulate more than one word worth of bits in C? Is there a better way to approach this task?

+3  A: 

Gnu multi-precision library provides integer implementation, with very good optimization for integers of arbitrary precision, and also has most useful bit twiddling functionality. (link)

Depending on the specific operations you actually need to perform, there may be some fancy data structures that might do the job a little better. For instance, there's the very clever Disjoint Sets structure, for modeling a set of disjoint sets, which has truly astounding asymptotic performance over the 3 operations it supports.

TokenMacGuy
Population count and even Hamming distance, ooooh!
Potatoswatter
population count is particularly useful here because it is equivalent to set cardinality, which is a nice thing to know sometimes.
TokenMacGuy
GMP is probably okay but I will _never_ recommend it for production work since I found out that it will simply exit if it runs out of memory. In my opinion, that's a cruddy design choice for a library to use.
paxdiablo
@pax, the default memory functions exit after printing to standard error, but the library makes it possible to use [custom allocators](http://gmplib.org/manual/Custom-Allocation.html) that behave differently (e.g. custom logging), though they still can't return normally.
Matthew Flaschen
@Matthew, I've had my long discussions with the GMP developers, no need to rehash them here. We ended up agreeing to disagree :-) since they didn't see it as a problem.
paxdiablo
I don't know the situation on windows, but on linux, running out of memory is basically scarytown. You don't ever know when you've actually run out of memory, since linux will happily overcommit. You don't know what state your program might be in at all, really.
TokenMacGuy
@TokenMacGuy: That's a standard argument and it's incorrect. Linux, like any half-decent operating system, makes it possible to disable overcommit, and anyone with even modest robustness requirements will do so. Also, on 32-bit machines, running out of *virtual* address space is a lot more likely than running out of *physical* storage space (which might include several gb of ram and several times that much in swap).
R..
+4  A: 

Yes, you simply define an array of your 32-bit integers. Then you manipulate a specific element of the array.

Given a bit ID from 0 through 255 inclusive (for example), that would be an array:

unsigned int bits[8];

In order to find which element to operate on:

unsigned int index = bitId >> 5; // turns 0..255 into 0..31

To get the masks for a given bit ID:

unsigned int masks[] = {
    0x0001, 0x0002, 0x0004, 0x0008,
    0x0001, 0x0020, 0x0040, 0x0080,
    0x0100, 0x0200, 0x0400, 0x0800,
    0x1000, 0x2000, 0x4000, 0x8000
};
unsigned int mask = masks[bitId & 0x1f];

If you have the uint32_t type available in your implementation, that's probably the safest way to go. Otherwise, there are known methods for using unsigned int using CHAR_BIT and sizeof to actually figure out at runtime how big to make the masks array and what values you should use for discovering the array index and bitmask index.

For example, this snippet from my code library shows how I did it for a character-based bitmask:

static unsigned char bitmask[CHAR_BIT];
void bitsetInit (void) {
        unsigned char mask = 1;
        int i = 0;
        while (i < CHAR_BIT) {
                bitmask[i++] = mask;
                mask <<= 1;
        }
}

and using:

bsp->bits[bitnum/CHAR_BIT] &= ~bitmask[bitnum%CHAR_BIT];
bsp->bits[bitnum/CHAR_BIT] |= bitmask[bitnum%CHAR_BIT];

for clearing and setting bits respectively.

If you wanted to use unsigned int instead of unsigned char you would simply calculate the number of bits for that:

unsigned int UINT_BIT = CHAR_BIT * sizeof (unsigned int);

and use it where I've used CHAR_BIT above (the mask array can be dynamically allocated at runtime if need be).

paxdiablo
I think `uint32_t` would be a much safer type to use here.
R..
If it's available, yes. The original question indicated 32 bits, which has now changed. So I'll update the answer ...
paxdiablo
Your definition of `UINT_BIT` is incorrect. `unsigned int` may have padding bits. Just used a fixed-width type, or use `unsigned char` and `CHAR_BIT` which is always the safest. (BTW the only systems that will lack fixed-width types are exactly the ones where the behavior of types like `unsigned int` will probably involve nasty stuff like padding.) Actually `8` might be better than `CHAR_BIT`, so if `CHAR_BIT` is something bogus like 9, you don't get stuck dividing by 9 instead of right-shifting. :-)
R..
> Given a bit ID from 0 through 255 inclusive (for example), that would be an array:> unsigned int bits[8];But that's space for at least 512 bits.
domen
Why do you complicate with `masks`/`bitmask` array? just use `1<<(bitnum%CHAR_BIT)`
domen
@domen, 8 32-bit (as I stated is the data type in the preceding paragraph) unsigned integers will only give you 256 bits, not 512 (in fact uints are only guaranteed to be 16 bits), and the use of masks is just another way of doing it. It also allows you to optimise if you precalc both masks (set and clear) at init time rather than bitshift and bitshift/invert at runtime. Whether it's faster on your specific platform to do a table lookup rather than bitshift/invert is something you should test. For my particular case (when I wrote that code), the lookup would have been faster.
paxdiablo
Oh, sorry, totally miscalculated 8*32 :$Optimize how exactly? When is bitshift slower than a memory lookup?
domen
That code is from HPUX9 (PA-RISC). I'm pretty certain I would have tested the speeds (the project was a high-performance one).
paxdiablo
+1  A: 

You could use uint64_t from <stdint.h>. Beyond that, I am afraid you are out of luck as far as & and | are concerned, and should look for a different design (e.g. structs with appropriate functions to handle them, or third-party libraries.).

DevSolar
+1  A: 

paxdiablo seems to have given you the correct approach to solve this problem the way you've said you want to solve it.

Is there a better way to approach this task?

Unless you have a specific performance or hardware reason to do your work at a bit-wise level, there might be better ways to represent a set. For example, a linked list or binary tree, who's values are members of the set. Both of these structures can have (effectively) infinite size, and are easy to iterate through.

Just because some set operations are easy to implement with boolean logic does not mean all are. The additional code that depends on your set operations will likely be more clear if you have a set-type interface, rather than a boolean logic (only) interface.

No matter what solution you come up with, I recommend you hide it behind an interface, so you can change your storage solution in the future. You can do this by defining functions which you pass your structure to, and only operating on the structure through those functions.

Merlyn Morgan-Graham
I disagree with your answer. Up to minor tweaks, there's only one possible implementation for a bit array. Any other data structure will have much worse performance, both in terms of big-O and plain practical overhead, and there's no sense wrapping your bit array operations and making them slower for the sake of possibly being able to swap in an even slower implementation in the future.
R..
@R: I'm talking about mathematical sets, not bit arrays. If you're just talking bit arrays, then you are right: there is no faster way to implement a bit array than a bit array. I haven't studied sets (besides the small amount I've been exposed to via database operations), so if there are no other operations than boolean algebra on sets, then tell me and I will delete my answer. I was under the impression that the domain was bigger than that, though. Also, bit fields are not necessarily the best storage solution for sparse allocation.
Merlyn Morgan-Graham
A: 

If you really are satisfied with 32 and 64 bit types, in modern C (aka C99) the typedefs uint_least32_t and uint_least64_t are guaranteed to exist in "stdint.h". In contrast to the exact-width types uint32_t and uint64_t (which are optional) they may correspond to a base type that has width that is wider than the number indicates.

If speed is important you may also use uint_fast32_t and uint_fast64_t which must also exist. They trade speed for size and are supposed to use the corresponding base type that has the "fastest" support on the target machine. The blow up for data may be substantial though. E.g. on my 64 bit ubuntu all these "fast" types are 64 bit types.

If you use gcc you'd also have __uint128_t on 64 bit machines as an extra service.

Jens Gustedt