views:

175

answers:

5

I have a bit array that can be very dense in some parts and very sparse in others. The array can get as large as 2**32 bits. I am turning it into a bunch of tuples containing offset and length to make it more efficient to deal with in memory. However, this sometimes is less efficient with things like 10101010100011. Any ideas on a good way of storing this in memory?

+2  A: 

If I understand correctly, you're using tuples of (offset, length) to represent runs of 1 bits? If so, a better approach would be to use runs of packed bitfields. For dense areas, you get a nice efficient array, and in non-dense areas you get implied zeros. For example, in C++, the representation might look like:

// The map key is the offset; the vector's length gives you the length
std::map<unsigned int, std::vector<uint32_t> >

A lookup would consist of finding the key before the bit position in question, and seeing if the bit falls in its vector. If it does, use the value from the vector. Otherwise, return 0. For example:

typedef std::map<unsigned int, std::vector<uint32_t> > bitmap; // for convenience
typedef std::vector<uint32_t> bitfield; // also convenience

bool get_bit(const bitmap &bm, unsigned int idx) {
  unsigned int offset = idx / 32;
  bitmap::const_iterator it = bm.upper_bound(offset);

  // bm is the element /after/ the one we want
  if (it == bm.begin()) {
    // but it's the first, so we don't have the target element
    return false;
  }

  it--;

  // make offset be relative to this element start
  offset -= it.first;
  // does our bit fall within this element?
  if (offset >= it.second.size())
    return false; // nope

  unsigned long bf = it.second[offset];
  // extract the bit of interest
  return (bf & (1 << (offset % 32))) != 0;
}
bdonlan
Sorry, but I am not sure how this works. What does the vector contain?
I've added a code sample - basically, the vector contains a series of uint32_ts, which each give a 32-bit slice of the bit array
bdonlan
A question about your code sample:Is there an off-by-one error here? You return false if (offset > it.second.size()), so if offset == it.second.size(), you don't return, and you then look at it.second[offset]. Should you have written >= instead of > ?
Steve Kass
@steve, good point. fixed.
bdonlan
+1  A: 

It would help to know more. By "very sparse/dense," do you mean millions of consecutive zeroes/ones, or do you mean local (how local?) proportions of 0's very close to 0 or 1? Does one or the other value predominate? Are there any patterns that might make run-length encoding effective? How will you use this data structure? (Random access? What kind of distribution of accessed indexes? Are huge chunks never or very rarely accessed?)

I can only guess you aren't going to be randomly accessing and modifying all 4 billion bits at rates of billions of bits/second. Unless it is phenomenally sparse/dense on a local level (such as any million consecutive bits are likely to be the same except for 5 or 10 bits) or full of large scale repetition or patterns, my hunch is that the choice of data structure depends more on how the array is used than on the nature of the data.

Steve Kass
A: 

How to structure things will be dependent on what is your data. For trying to represent large amounts of data, you will need to have long runs of zeros or ones. This would eliminate the need to have it respresented. If this is not the case and you have approxiately the same amount of one's and zeros, you would be better off with all of the memory.

It might help to think of this as a compression problem. For compression to be effective there has to be a pattern (or a limit set of items used out of an entire space) and an uneven distribution in order for compression to work. If all the elements are used and evenly distributed, compression is hard to do, or could take more space then the actual data.

If there are only runs of zero and ones, (more then just one), using offset and length might make some sense. If there is inconsistent runs, you could just copy the bits as a bit array where you have offset, length, and values.

How efficient the above is will depend upon if you have a large runs of ones or zeros. You will want to be careful to make sure you are not using more memory to reperesent your memory, then just using memory itself, (i.e. your are using more memory to represent the memory then just placing it into memory).

Glenn
A: 

Check out bison source code. Look at biset implementation. It provides several flavors of implementations to deal with bit arrays with different densities.

ZZ Coder
A: 

How many of these do you intend to keep in memory at once?

As far as I can see, 2**32 bits = 512M, only half a gig, which isn't very much memory nowadays. Do you have anything better to do with it?

Assuming your server has enough ram, allocate it all at startup, then keep it in memory, the network handling thread can execute in just a few instructions in constant time - it should be able to keep up with any workload.

MarkR