Thanks to those who've answered my previous questions and gotten me this far.
I have a table of about 25,000 vectors, each with 48 dimensions, with values ranging from 0-255.
I am attempting to develop a Locality Sensitive Hash (http://en.wikipedia.org/wiki/Locality-sensitive_hashing) algorithm for finding a near neighbor or closest neighbor points.
My current LSH function is this:
def lsh(vector, r = 1.0, a = None, b = None):
if not a:
a = [normalvariate(10, 4) for i in range(48)]
if not b:
b = uniform(0, r)
hashVal = floor((sum([a[i]*vector[i] for i in range(48)]) + b)/r)
return int(hashVal)
My questions at this point are:
A: Are there optimal values for "normalvariate(10, 4)" portion of my code. This is pythons built in random.normalvariate (http://docs.python.org/library/random.html#random.normalvariate) function which I am using to produce the "d dimensional vector with entries chosen independently from a stable distribution". From my experimenting, the values don't seem to matter too much.
B: At the top of the wikipedia article it states:
if d(p,q) <= R, then h(p) = h(q) with probability at least P1
if d(p,q) >= cR, then h(p) = h(q) with probability at most P2
Is the R value mentioned here also the R value mentioned under the Stable Distributions section. (http://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
C: Related to my previous question(B). I've discovered that using higher values of R in my hasing function map my vectors into a smaller range of hash values. Is there a way to optimize my R value.
D: Approximately how many tables might one use?