views:

152

answers:

1

Hi,

I am looking for a hash functions family generator that could generate a family of hash functions given a set of parameters. I haven't found any such generator so far. Is there a way to do that with the hashlib package ?

For example I'd like to do something like : h1 = hash_function(1) h2 = hash_function(2) ...

and h1 and h2 would be different hash functions.

For those of you who might know about it, I am trying to implement a min-hashing algorithm on a very large dataset.

Basically, I have a very large set of features (100 millions to 1 billion) for a given document, and I need to create 1000 to 10000 different random permutations for this set of features.

I do NOT want to build the random permutations explicitly so the technique I would like to use in the following : generate a hash function h and consider that for two indices r and s, r appear before s in the permutation if h(r) < h(s) and do that for 100 to 1000 different hash functions.

Are there any known libraries that I might have missed ? Or any standard way of generating families of hash functions with python that you might be aware of ?

Thanks for your help,

Best,

Nicolas.

+2  A: 

I'd just do something like (if you don't need thread-safety -- not hard to alter if you DO need thread safety -- and assuming a 32-bit Python version):

import random

_memomask = {}

def hash_function(n):
  mask = _memomask.get(n)
  if mask is None:
    random.seed(n)
    mask = _memomask[n] = random.getrandbits(32)
  def myhash(x):
    return hash(x) ^ mask
  return myhash
Alex Martelli
Thanks for this answer. It seems to work great. Any particular for using those type of hash functions ? efficiency ? will yield very different approximate permutations in some sense ?
Nicolas M.
The built-in `hash` is decent and pretty efficient -- xor'ing it with a number depending (but in a sufficiently chaotic way) from the index within the family just seems another decent/efficient way to turn that one hash function into a family. If speed's not a problem you could use stronger (crypto-quality) hashing, I guess -- that will presumably give you higher quality (neither hash nor random are crypto-quality and thus neither's their XOR;-) but the speed impact is REALLY large (orders of magnitude...).
Alex Martelli
Thanks. Actually, I believe that speed will be key for me here. The only "quality" I am looking for is that the hash functions will generate "as different" random permutations as possible ( I am not sure how to quantify this though...) by the process I described in my original question. Again, thanks a lot for your great answer.
Nicolas M.