ansaurus

Question

Probabilistic hashing -- is there such a thing?

Answer 1

+13 A:

You could probably (ab?)use a bloom filter for something like this.

Hank Gay 2009-06-17 16:08:52

Cool information, didn't know such a thing existed.

Stefan Mai 2009-06-17 16:09:49

Yes, this is almost exactly the application Bloom filters were designed for. Note also that the errors are only one-way (only occasional false positives, never false negatives).

ShreevatsaR 2009-06-17 16:19:59

This is good to have seen--I hadn't heard of it either, but I think using a bit array might be better in this case (suggested by another answer).

Bill K 2009-06-17 18:56:10

A Bloom filter *is* a bit array. If you have as many bits as the search space, then you need to use only one "hash function" (e.g. the identity function), and Bloom filter reduces as a special case to the naive bit array implementation. When you have less space than that, using a Bloom filter is an elegant and simple thing to do.

ShreevatsaR 2009-06-17 19:32:08

Yes, but if you have enough room, a one-to-one bit array will work better, there is no reason to deal with hashing or failures or guessing at an array size or the fact that as you add more entries you will have geometrically more false positives.

Bill K 2009-06-18 01:50:40

And the Bloom filter will not reduce automatically to a bit array equivalent. Hashes usually have the ability to return duplicate values for different items. If you re-worked your hash function to be 1:1 then--well then you'd have a 1-1 bit map, not a Bloom filter. Read the referenced wikipedia article where they suggest using a bitmap instead if you can fit it in memory.

Bill K 2009-06-18 01:52:36

I already agreed with that -- when there is enough space, a simple bit array is the most natural thing to use. (And a hash function that maps N values to N or more bins and *still* has collisions would be a poor hash function indeed! That's why I mentioned the identity function as an example: with one such hash function, a Bloom filter is by definition the same as a 1-1 bit map.)

ShreevatsaR 2009-06-18 17:05:03

Answer 2

+2 A:

Sure! Decide how many "bins" you can afford (at one bit each), say N; hash the IP address to a string of bits B; take B modulo N. You can compute the probability of accidental collisions (with some approximation, such as, assume all hashed IP addresses form equally likely bitstrings B) and determine N accordingly if you have a constraint on maximum accidental collision probability that's acceptable for your application.

Alex Martelli 2009-06-17 16:10:26

You can even use the collision probability to calculate how many collisions you probably missed for a given "fill-level" of the hash table, and with this increase the accuracy of the guessed total count.

sth 2009-06-17 16:37:16

That is also called a bloom filter

tomjen 2009-06-18 10:00:43

Answer 3

A:

Integer (ipv4) or string (ipv6) hashing without collision handling, using the hash-value (modulo hash table size) as the index to a bitmap-array.

bill 2009-06-17 16:13:14

Answer 4

+3 A:

All hashing is lossy, by virtue of the pigeonhole principle. Inevitably, you are trying to cram N things into M slots (where N >> M). All you need to do is simply not handle collision cases and pick a sufficiently large hash table.

John Feminella 2009-06-17 16:23:46

Answer 5

+4 A:

Assuming IPv4 addresses, there is a search space of 2³². You need no more than 1 bit per IP address (0 == no visit, 1 == visit). Without considering storage overhead, this would take 512 MB (2²⁹) to store. So a simplistic implementation would allocate a 512 MB array (or a table with 2²⁹ rows, each storing a byte, or 2²⁷ rows, each storing a 32-bit integer, or 2²⁶ rows, each storing a 64-bit integer, or...)

You could optimize this for sparse population by turning it into a tree.

Define a "page" size of 2^x bits. You will allocate storage for one page at a time.

Divide your search space (2³²) by your page size. This is the total number of pages required to represent every possible address in your search space.

Then, to determine if there is a hit in your hash, you will first determine if the page is present, and if so, whether the appropriate bit in the page is set. To cache an address, you will first determine if the page is present; if not you will create it. Next you will set the appropriate bit.

This molds fairly easily to a database table. You would need just two columns: a page index and a binary array. When you allocate a page, you will simply store a row in the table with the correct page index and an empty binary array.

For instance, for a 1024-bit page size (yielding 2²² maximum pages), you might structure your table like this:

CREATE TABLE VisitedIPs(
    PageIndex int         NOT NULL PRIMARY KEY,
    PageData  binary(128) NOT NULL
)

To check whether an IP has visited, you would use code similar to (pseudocode):

uint ip = address.To32Bit();

string sql =
    "SELECT PageData " +
    "FROM VisitedIPs " +
    "WHERE PageIndex = " + (ip >> 10);

byte[] page = (byte[])GetFromDB(sql);

byte b = page[(ip & 0x3FF) >> 3];

bool hasVisited = (b & (1 << (ip & 7)) != 0;

Setting that an IP has visited is similar:

uint ip = address.To32Bit();

string sql = "SELECT PageData " + "FROM VisitedIPs " + "WHERE PageIndex = " + (ip >> 10);

byte[] page = (byte[])GetFromDB(sql);

page[(ip & 0x3FF) >> 3] |= (1 << (ip & 7));

sql =
    "UPDATE VisitedIPs " +
    "SET PageData = @pageData " +
    "WHERE PageIndex = " + (ip >> 10);

ExecSQL(sql, new SqlParam("@pageData", page));

P Daddy 2009-06-17 16:55:05

Answer 6

+1 A:

Start truncating bits.

The probability of a hash collision becomes 50% when you have 2^(n/2) things out of a possible 2^n. An IP address is 2^32 so there's a 50% chance of a collision when 2^16 items are in the container.

Decrease as you feel comfortable.

2009-06-17 17:10:10

+1: Since the addresses have known structures (A, B and C) you can leverage this to decide where and how to skip bits.

S.Lott 2009-06-17 17:17:39

ansaurus

tags:

views:

answers:

Probabilistic hashing -- is there such a thing?

related questions