views:

499

answers:

4

I'd like to know about specific problems you - the SO reader - have solved using bloom filters and what libraries/frameworks you used if you didn't roll your own.

Questions:

  • What problems have you used bloom filters to solve?
  • What libraries/frameworks did you use?

I'm looking for first-hand experiences, so please do not answer unless you have that.

+6  A: 

I used PyBloom to build a bloom filter that represents the set of all Wikipedia article titles. One nice thing about that implementation is that it takes parameters for the error rate and the size of the set, so if you know in advance how large your set will be you can set the error rate very low.

I was impressed that I could build the set of ~7M entries in about 5 minutes, and I can load it from disk in 0.1s.

danben
Great answer! Please keep the tags unchanged. "bloom-filter" only is not sufficient.
knorv
Looking over them again, I am ok with language-agnostic and data-structures. "computer science" is much too broad and "algorithm" is not really applicable. Tags are for categorizing questions to assist future viewers - if we shoved every question into every tag it would devalue them.
danben
Why did you want to construct a set of article titles, without needing to store the actual titles, and allowing false positives?
Qwertie
@Qwertie - I am doing entity extraction over unstructured text. I use a probabilistic algorithm, part of which matches n-grams against Wikipedia article titles. Storing the actual titles is unnecessary for obvious reasons, and false positives are ok because the other parts of the algorithm will cause those to fall to the bottom of the ranking.
danben
+6  A: 

I used an own implementation of bloom filter. This is pretty easy, but you need good hash functions.

I usually use them as an optimization to avoid disk reads. If the program wants to read a value from disk or search for a value on disk, I use a bloom filter of all values to avoid the disk read if the bloom filter returns a negative answer. The program would also be correct without the bloom filter, but when there is a signification chance that the value is not stored on disk -- and therefore the disk read would be useless -- the bloom filter is a nice optimization.

In another scenario I used a bloom filter to hold a "set" of all values that have been touched in some time period. In that specific situation, I did't care about the false positives, it was more important that each value actually touched is "covered". One nice aspect of bloom filters here are that the have a constant size in contrast to a read set. Ok, the error rate might go up, but in that situation I could live with that.

dmeister
+1  A: 

http://blog.rapleaf.com/dev/2007/09/05/bloomfilter/

While I don't have anything directly to do with this,I strongly feel the information on that link is definitely relevant to the discussion over here

A: 

I used them in my thesis actually:

http://sbirch.net/thesis/public.pdf (see Section 3/page 10).

sbirch