Here is the main problem. I have very large database (25,000 or so) of 48 dimensional vectors, each populated with values ranging from 0-255. The specifics are not so important but I figure it might help give context.
I don't need a nearest neighbor, so approximate neighbor searches that are within a degree of accuracy are acceptable. I've been toying around with Locality Sensitivity Hashing but I'm very very lost.
I've written a hash function as described in the article under "Stable Distributions" as best I can. Here is the code.
def lsh(vector, mean, stdev, r = 1.0, a = None, b = None):
if not a:
a = [normalvariate(mean, stdev) for i in range(48)]
if not b:
b = uniform(0, r)
hashVal = (sum([a[i]*vectorA[i] for i in range(48)]) + b)/r
return hashVal
The hashing function is 'working' at least some. If I order a list of points by hash value and compute average distance between a point and it's neighbor in the list, the average distance is about 400, compared to an average distance of about 530 for any two randomly selected points.
My biggest questions are these.
A: Any suggestions on where I can read more about this. My searching hasn't produced a lot of results.
B: The method suggests it outputs an integer value (which mine does not). And then you're supposed to try to find matches for this integer value and a match denotes a likely nearest neighbor. I understand I'm supposed to compute some set of tables of hash values for all my points and then check said tables for hash matches, but the values I'm returning don't seem to be fine enough that I'll end up with matches at all. More testing is needed on my part.
C: Instructions on how to construct hash functions based on the other hashing methods?