views:

363

answers:

6

Say you want to implement a click tracker where you want to only count a click to a link from any IP address once, but the number of links and clients is very large and you don't want to keep a table of every single IP-click. Say that you might need this as part of something that runs live against every click and don't want to do a lookup against a big table for every click.

Is there such a thing as "probabilistic hashing" or "lossy hashing" to see if an IP is probably in a set but you don't care if there is a certain error rate as you want to save resources?

+13  A: 

You could probably (ab?)use a bloom filter for something like this.

Hank Gay
Cool information, didn't know such a thing existed.
Stefan Mai
Yes, this is almost exactly the application Bloom filters were designed for. Note also that the errors are only one-way (only occasional false positives, never false negatives).
ShreevatsaR
This is good to have seen--I hadn't heard of it either, but I think using a bit array might be better in this case (suggested by another answer).
Bill K
A Bloom filter *is* a bit array. If you have as many bits as the search space, then you need to use only one "hash function" (e.g. the identity function), and Bloom filter reduces as a special case to the naive bit array implementation. When you have less space than that, using a Bloom filter is an elegant and simple thing to do.
ShreevatsaR
Yes, but if you have enough room, a one-to-one bit array will work better, there is no reason to deal with hashing or failures or guessing at an array size or the fact that as you add more entries you will have geometrically more false positives.
Bill K
And the Bloom filter will not reduce automatically to a bit array equivalent. Hashes usually have the ability to return duplicate values for different items. If you re-worked your hash function to be 1:1 then--well then you'd have a 1-1 bit map, not a Bloom filter. Read the referenced wikipedia article where they suggest using a bitmap instead if you can fit it in memory.
Bill K
I already agreed with that -- when there is enough space, a simple bit array is the most natural thing to use. (And a hash function that maps N values to N or more bins and *still* has collisions would be a poor hash function indeed! That's why I mentioned the identity function as an example: with one such hash function, a Bloom filter is by definition the same as a 1-1 bit map.)
ShreevatsaR
+2  A: 

Sure! Decide how many "bins" you can afford (at one bit each), say N; hash the IP address to a string of bits B; take B modulo N. You can compute the probability of accidental collisions (with some approximation, such as, assume all hashed IP addresses form equally likely bitstrings B) and determine N accordingly if you have a constraint on maximum accidental collision probability that's acceptable for your application.

Alex Martelli
You can even use the collision probability to calculate how many collisions you probably missed for a given "fill-level" of the hash table, and with this increase the accuracy of the guessed total count.
sth
That is also called a bloom filter
tomjen
A: 

Integer (ipv4) or string (ipv6) hashing without collision handling, using the hash-value (modulo hash table size) as the index to a bitmap-array.

bill
+3  A: 

All hashing is lossy, by virtue of the pigeonhole principle. Inevitably, you are trying to cram N things into M slots (where N >> M). All you need to do is simply not handle collision cases and pick a sufficiently large hash table.

John Feminella
+4  A: 

Assuming IPv4 addresses, there is a search space of 232. You need no more than 1 bit per IP address (0 == no visit, 1 == visit). Without considering storage overhead, this would take 512 MB (229) to store. So a simplistic implementation would allocate a 512 MB array (or a table with 229 rows, each storing a byte, or 227 rows, each storing a 32-bit integer, or 226 rows, each storing a 64-bit integer, or...)

You could optimize this for sparse population by turning it into a tree.

Define a "page" size of 2x bits. You will allocate storage for one page at a time.

Divide your search space (232) by your page size. This is the total number of pages required to represent every possible address in your search space.

Then, to determine if there is a hit in your hash, you will first determine if the page is present, and if so, whether the appropriate bit in the page is set. To cache an address, you will first determine if the page is present; if not you will create it. Next you will set the appropriate bit.

This molds fairly easily to a database table. You would need just two columns: a page index and a binary array. When you allocate a page, you will simply store a row in the table with the correct page index and an empty binary array.

For instance, for a 1024-bit page size (yielding 222 maximum pages), you might structure your table like this:

CREATE TABLE VisitedIPs(
    PageIndex int         NOT NULL PRIMARY KEY,
    PageData  binary(128) NOT NULL
)

To check whether an IP has visited, you would use code similar to (pseudocode):

uint ip = address.To32Bit();

string sql =
    "SELECT PageData " +
    "FROM VisitedIPs " +
    "WHERE PageIndex = " + (ip >> 10);

byte[] page = (byte[])GetFromDB(sql);

byte b = page[(ip & 0x3FF) >> 3];

bool hasVisited = (b & (1 << (ip & 7)) != 0;

Setting that an IP has visited is similar:

uint ip = address.To32Bit();

string sql = "SELECT PageData " + "FROM VisitedIPs " + "WHERE PageIndex = " + (ip >> 10);

byte[] page = (byte[])GetFromDB(sql);

page[(ip & 0x3FF) >> 3] |= (1 << (ip & 7));

sql =
    "UPDATE VisitedIPs " +
    "SET PageData = @pageData " +
    "WHERE PageIndex = " + (ip >> 10);

ExecSQL(sql, new SqlParam("@pageData", page));
P Daddy
+1  A: 

Start truncating bits.

The probability of a hash collision becomes 50% when you have 2^(n/2) things out of a possible 2^n. An IP address is 2^32 so there's a 50% chance of a collision when 2^16 items are in the container.

Decrease as you feel comfortable.

+1: Since the addresses have known structures (A, B and C) you can leverage this to decide where and how to skip bits.
S.Lott