ansaurus

Question

Modern, high performance bloom filter in Python?

Answer 1

+3 A:

Look at the array module.

class Bit( object ):
    def __init__( self, size ):
        self.bits= array.array('B',[0 for i in range((size+7)//8)] )
    def set( self, bit ):
        b= self.bits[bit//8]
        self.bits[bit//8] = b | 1 << (bit % 8)
    def get( self, bit ):
        b= self.bits[bit//8]
        return (b >> (bit % 8)) & 1

FWIW, all of the //8 and % 8 operations can be replaced with >>3 and &0x07. This may lead to slightly better speed at the risk of some obscurity.

Also, changing 'B' and 8 to 'L' and 32 should be faster on most hardware. [Changing to 'H' and 16 might be faster on some hardware, but it's doubtful.]

S.Lott 2008-11-22 13:51:25

That's lovely! +1

Ali A 2008-11-22 16:52:33

Answer 2

+10 A:

I recently went down this path as well; though it sounds like my application was slightly different. I was interested in approximating set operations on a large number of strings.

You do make the key observation that a fast bit vector is required. Depending on what you want to put in your bloom filter, you may also need to give some thought to the speed of the hashing algorithm(s) used. You might find this library useful. You may also want to tinker with the random number technique used below that only hashes your key a single time.

In terms of non-Java bit array implementations:

Boost has dynamic_bitset
Java has the built in BitSet

I built my bloom filter using BitVector. I spent some time profiling and optimizing the library and contributing back my patches to Avi. Go to that BitVector link and scroll down to acknowledgments in v1.5 to see details. In the end, I realized that performance was not a goal of this project and decided against using it.

Here's some code I had lying around. I may put this up on google code at python-bloom. Suggestions welcome.

from BitVector import BitVector
from random import Random
# get hashes from http://www.partow.net/programming/hashfunctions/index.html
from hashes import RSHash, JSHash, PJWHash, ELFHash, DJBHash


#
# [email protected] / www.asciiarmor.com
#
# copyright (c) 2008, ryan cox
# all rights reserved 
# BSD license: http://www.opensource.org/licenses/bsd-license.php
#

class BloomFilter(object):
def __init__(self, n=None, m=None, k=None, p=None, bits=None ):
 self.m = m
 if k > 4 or k < 1:
  raise Exception('Must specify value of k between 1 and 4')
 self.k = k
 if bits:
  self.bits = bits
 else:
  self.bits = BitVector( size=m )
 self.rand = Random()
 self.hashes = []
 self.hashes.append(RSHash)
 self.hashes.append(JSHash)
 self.hashes.append(PJWHash)
 self.hashes.append(DJBHash)

 # switch between hashing techniques
 self._indexes = self._rand_indexes
 #self._indexes = self._hash_indexes

def __contains__(self, key):
 for i in self._indexes(key): 
  if not self.bits[i]:
   return False 
 return True 

def add(self, key):
 dupe = True 
 bits = []
 for i in self._indexes(key): 
  if dupe and not self.bits[i]:
   dupe = False
  self.bits[i] = 1
  bits.append(i)
 return dupe

def __and__(self, filter):
 if (self.k != filter.k) or (self.m != filter.m): 
  raise Exception('Must use bloom filters created with equal k / m paramters for bitwise AND')
 return BloomFilter(m=self.m,k=self.k,bits=(self.bits & filter.bits))

def __or__(self, filter):
 if (self.k != filter.k) or (self.m != filter.m): 
  raise Exception('Must use bloom filters created with equal k / m paramters for bitwise OR')
 return BloomFilter(m=self.m,k=self.k,bits=(self.bits | filter.bits))

def _hash_indexes(self,key):
 ret = []
 for i in range(self.k):
  ret.append(self.hashes[i](key) % self.m)
 return ret

def _rand_indexes(self,key):
 self.rand.seed(hash(key))
 ret = []
 for i in range(self.k):
  ret.append(self.rand.randint(0,self.m-1))
 return ret

if __name__ == '__main__':
    e = BloomFilter(m=100, k=4)
    e.add('one')
    e.add('two')
    e.add('three')
    e.add('four')
    e.add('five')        

    f = BloomFilter(m=100, k=4)
    f.add('three')
    f.add('four')
    f.add('five')
    f.add('six')
    f.add('seven')
    f.add('eight')
    f.add('nine')
    f.add("ten")        

    # test check for dupe on add
    assert not f.add('eleven') 
    assert f.add('eleven') 

    # test membership operations
    assert 'ten' in f 
    assert 'one' in e 
    assert 'ten' not in e 
    assert 'one' not in f         

    # test set based operations
    union = f | e
    intersection = f & e

    assert 'ten' in union
    assert 'one' in union 
    assert 'three' in intersection
    assert 'ten' not in intersection
    assert 'one' not in intersection

Also, in my case I found it useful to have a faster count_bits function for BitVector. Drop this code into BitVector 1.5 and it should give you a more performant bit counting method:

def fast_count_bits( self, v ):
    bits = (
            0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
            1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
            1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
            2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
            1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
            2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
            2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
            3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
            1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
            2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
            2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
            3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
            2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
            3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
            3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
            4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8 )

    return bits[v & 0xff] + bits[(v >> 8) & 0xff] + bits[(v >> 16) & 0xff] + bits[v >> 24]

Ryan Cox 2008-11-22 23:35:59

Thanks Ryan, very informative. Regarding performance of BitVector, did you find a faster alternative? Also, I noticed you're only using 4 hashes, which seems a bit low. Any thoughts on that? A common practice seems to be using something like SHA1 and split up the bits to form multiple hashes.

Parand 2008-11-23 02:06:05

Hashcount depends on: # elements and acceptable false positive rate. I have an improved version of the above that I will checkin. Haven't found anything faster ( though I imagine that it would be a native implementation ).

Ryan Cox 2008-11-27 19:04:26

Answer 3

+3 A:

Eventually I found pybloomfiltermap. I haven't used it, but it looks like it'd fit the bill.

Parand 2010-08-02 17:00:09

let me know if the library is useful for you!

Mike Axiak 2010-09-15 23:16:25

Answer 4

A:

Hi,

I am keenly interested in Bloom filters variants, their performance and understand their use-cases. There are so many well-cited research work on Bloom filter variants( including ones published in top-notch conferences like SIGCOMM,SIGMETRICS) yet I dont think their presence is strong in mainstream language libraries. Why do you think that's the case?

While my interest is language-agnostic,I wanted to share an article I wrote on Bloom filter variants, and applications of Bloom filter.

http://appolo85.wordpress.com/2010/08/03/bloom-filter/

I would love to learn more about their use-cases of the Bloom filter variants, and their design/implementation, and libraries in other languages.

Do you think that most of the publications, and ( code?) on Bloom filters variants , only serve to increment the published paper count of a PhD graduate?
Or is it that most people do not want to mess with a production-ready standard bloom filter implementation that "works just fine" :D

2010-08-03 19:45:53

ansaurus

tags:

views:

answers:

Modern, high performance bloom filter in Python?

related questions