views:

448

answers:

4

I [surely re] invented this [wheel] when I wanted to compute the union and the intersection and diff of two sets (stored as lists) at the same time. Initial code (not the tightest):

dct = {}
for a in lst1:
  dct[a] = 1
for b in lst2:
  if b in dct:
    dct[b] -= 1
  else:
    dct[b] = -1

union = [k for k in dct]
inter = [k for k in dct if dct[k] == 0]
oneminustwo = [k for k in dct if dct[k] ==  1]
twominusone = [k for k in dct if dct[k] ==  -1]

Then I realized that I should use 00, 01, 10, and 11 instead of -1, 1, 0, ... So, a bit at position n indicates membership in set n.

This can be generalized to up to 32 sets using an 32-bit int, or to any number of sets using a bitarray, or a string. So, you pre-compute this dictionary once, and then use very fast O(n) queries to extract elements of interest. For instance, all 1s means intersection of all sets. All 0s is a special one - will not occur.

Anyhow, this is not to toot my own horn. This surely was invented before and has a name. What is it called? Is this approach used in databases somewhere?

Thanks.

+1  A: 

Is the idea of a bit field what you're looking for? Every bit of your ... field (for lack of a better word) represents a flag. In this case, your flag is membership in set N.

Edit - I think I misunderstood which idea you were referring to. Disregard?

Sapph
Ok, bit field is the value, but what about using the bit field for queries?
Hamish Grubijan
It's not **exactly** what you're talking about, but you might be interested in reading up on "bloom filters". They use a bit field to check for set membership (but can turn up false positives).
Sapph
+1  A: 

It's very common to use an integer to represent a set of small integers; it's often called a bitset or bitvector. Here you're using an integer to represent "the set of input sequences that contain this value".

The operation you're doing reminds me of reversing a multimap.

In your case, the input is a list of lists:

[["a", "b"], ["a", "c", "d"]]

But you could instead think of it as a bag of ordered pairs, like this:

0, "a"
0, "b"
1, "a"
1, "c"
1, "d"

You're simply constructing a table that contains the reversed pairs

"a", 0
"b", 0
"a", 1
"c", 1
"d", 1

which looks like this:

{"a": [0, 1],
 "b": [0],
 "c": [1],
 "d": [1]}

and you happen to be representing those arrays of integers using bitvectors.

The original data structure (the list of lists) made it easy to iterate over all values for a given list. The reversed data structure (the dictionary of lists) makes it easy to find all lists that have a given value.

Jason Orendorff
+7  A: 

Using an N bit integer to represent N booleans is a special case of the data structure known as a perfect hash table. Notice you're explicitly using dicts (which are general hash tables) in the idea that prompted you to think about bitsets. It's a hash table because you use hashes to find a value, and it's perfect because you never have collisions. The special case is because of how the table is packed and stored.

Formulate the hash function, which shows how it's different from an array:

int bitset_hash(int n) {
  // domain of this function is only non-negative ints
  return 1 << n;
}

Notice bitset_hash(3) is 0b1000, which corresponds to the 4th item (offset/index 3) when using a C int and bitwise operations. (Because of the storage implementation detail, bitwise operations are also used to manipulate a specific item from the hash.)

Extending the approach to use bitwise-and/-or/-xor for set operations is common, and doesn't require any special name other than "set operations" or, if you need a buzzword, "set theory".

Finally, here's another example use of it in a prime sieve (I've used this code on Project Euler solutions):

class Sieve(object):
  def __init__(self, stop):
    self.stop = stop
    self.data = [0] * (stop // 32 // 2 + 1)
    self.len = 1 if stop >= 2 else 0
    for n in xrange(3, stop, 2):
      if self[n]:
        self.len += 1
        for n2 in xrange(n * 3, stop, n * 2):
          self[n2] = False

  def __getitem__(self, idx):
    assert idx >= 2
    if idx % 2 == 0:
      return idx == 2
    int_n, bit_n = divmod(idx // 2, 32)
    return not bool(self.data[int_n] & (1 << bit_n))

  def __setitem__(self, idx, value):
    assert idx >= 2 and idx % 2 != 0
    assert value is False
    int_n, bit_n = divmod(idx // 2, 32)
    self.data[int_n] |= (1 << bit_n)

  def __len__(self):
    return self.len

  def __iter__(self):
    yield 2
    for n in xrange(3, self.stop, 2):
      if self[n]:
        yield n
Roger Pate
-1 for calling an array a special case of a hash table :)
Jason Orendorff
Jason Orendorff
I did call it a special case due to exactly how it's stored. In a non-perfect hash table, which is most common, I wouldn't call the collision resolution a "flat array", even though an array is commonly used, as there's much more going on. To me that's an implementation detail anyway, and it's the fast lookup of values from hash(key) that's the important part of the definition. -- Though I just realized that if you "take my definitions", you get them all and there's thus no problem. :)
Roger Pate
I didn't say the collision resolution was a flat array. That makes no sense.
Jason Orendorff
*"it's the fast lookup of values from hash(key) that's the important part of the definition"* So your definition of a hash table is "any data structure with fast lookup based on hash(key)". I think this is too loose a definition because, as you've demonstrated, you can point to any intermediate result of the lookup algorithm and call it a hash code. Under your definition, a simple array is a hash table with hash(idx) = (addr + idx * size).
Jason Orendorff
More precisely, you were completely ignoring collisions. The popular array-of-linked-lists implementation, for example, *is not a flat array*. And no, if you want to stretch things that way, a simple array would be a hash table with an identity hash function---which is only a hash function in the most basic sense. Please don't put words in my mouth, as I never said emphasized *any data structure* like that, just that fast lookup from keys is *more* important than implementation issues.
Roger Pate
An array is a special case of a dictionary (or, alternatively, a dictionary is a generalization of array). Dictionaries are often implemented using hash tables. Therefore, it's not a huge stretch to call an array a special case of hash table.
Wayne Conrad
Roger: I didn't mean to put words in your mouth, just understand what you were saying. Cheers!
Jason Orendorff
+4  A: 

Yes, it is sometimes used in databases, for example PostgreSQL. As mentions Wikipedia:

Some database systems that do not offer persistent bitmap indexes use bitmaps internally to speed up query processing. For example, PostgreSQL versions 8.1 and later implement a "bitmap index scan" optimization to speed up arbitrarily complex logical operations between available indexes on a single table.

(from http://en.wikipedia.org/wiki/Bitmap_index)

Antoine P.