views:

238

answers:

5

Are there any well-known libraries in Java for sparse bit vectors?

(And are there guidelines for how sparse is useful to use them vs. java.util.BitSet?)

A: 

CERN COLT is widely used for vector and matrix computation, and has sparse matrices, but isn't specifically used for bit vectors.

http://acs.lbl.gov/software/colt/api/cern/colt/matrix/impl/SparseObjectMatrix1D.html

gilesc
+3  A: 

The colt library has sparse matrices (1D, 2D and 3D). It also has an efficient BitVector, with 1 bit per value, rather than 8-bits as boolean[] does.

However, the sparse matrices do not support bits directly - only doubles and objects. You could wrap the 1D sparse double matrix by maping bit index to long indices (bitIndex>>6) since each long holds 64 bits, convert the retrieved double to a raw long value, and use bit manipulation to access the bits of the retrieved long. A little work, but nowhere near as much as implementing the sparse vector yourself. Once your wrapper is working, you might avoid converting doubles to longs, and implement a real sparse long 1d matrix using the available Colt source code for the double 1D sparse matrix as a starting point.

EDIT: More info. The Colt vectors/matrices require no memory initially for storage, assuming all bits (longs) are initially 0. Setting a value to non-zero consumes memory. Setting the value back to 0 continues to consume memory, although memory for zero values is reclaimed periodically.

If the bits are truly sparse, such that each backing long value only has one bit set, then the storage overhead will be very poor, requiring 64-bits per actual bit stored. But as you mention typical case is 20-40% sparse, then the overhead will be much lower, with possibly no wasted storage if bits are clustered in ranges, e.g. bits from 0-100, then 1000-1100, and 2000-2200 (values in hex.) Overall, only 1/16 of the region is assigned to bits, but the clustering means that the bits are stored with no wasted space.

mdma
`java.util.BitSet` stores 64 bit values per element in it's long array. It does not use 8-bits (boolean) to actually store those values.
Kevin Brock
@Kevin - thanks for that - I was thinking of a `boolean[]`. java.util.BitSet is based on `long[]`. Post updated.
mdma
+3  A: 

If its really sparse (e.g., less than 1% loading), then using a hash table indexed by bit index is probably pretty good; mere presence or absence of the index in the table is all you need to know if the bit is one or zero respectively.

If the density is upwards of a few percent, you can use a hash table indexed by bit index divided by 64, and store long words in the hash table containing actual bits. Bit N is set if the hash table contains value V for int(N/64) and (V>>(N mod 64))&1 is true.

Both of these answers assume you want to optimize random access to bits. If you want to optimize sequential (or other access) to bits by index, then you may want a sparse matrix structure, using the same kind of low-level bit vector representation depending on expected density. See Sparse Matrices

Ira Baxter
+1  A: 

You could try FastUtil's AVL Tree Map.

Michael Barker
Yes, but a lot of FastUtil's datastructures aren't fast. And some of them have horrible garbage problems! ;-)
dty
A: 

A hash table where the mere presence or absence of the key tells you something? That would be a hash set then! I'm skeptical of the performance of a set (even a hashed one) over the BitSet. It really depends on whether speed or memory is the primary driver.

dty
Errr... why the down vote?
dty