tags:

views:

106

answers:

2

Here is the main problem. I have very large database (25,000 or so) of 48 dimensional vectors, each populated with values ranging from 0-255. The specifics are not so important but I figure it might help give context.

I don't need a nearest neighbor, so approximate neighbor searches that are within a degree of accuracy are acceptable. I've been toying around with Locality Sensitivity Hashing but I'm very very lost.

I've written a hash function as described in the article under "Stable Distributions" as best I can. Here is the code.

def lsh(vector, mean, stdev, r = 1.0, a = None, b = None):
 if not a:
  a = [normalvariate(mean, stdev) for i in range(48)]
 if not b:
  b = uniform(0, r)
 hashVal = (sum([a[i]*vectorA[i] for i in range(48)]) + b)/r
 return hashVal

The hashing function is 'working' at least some. If I order a list of points by hash value and compute average distance between a point and it's neighbor in the list, the average distance is about 400, compared to an average distance of about 530 for any two randomly selected points.

My biggest questions are these.

A: Any suggestions on where I can read more about this. My searching hasn't produced a lot of results.

B: The method suggests it outputs an integer value (which mine does not). And then you're supposed to try to find matches for this integer value and a match denotes a likely nearest neighbor. I understand I'm supposed to compute some set of tables of hash values for all my points and then check said tables for hash matches, but the values I'm returning don't seem to be fine enough that I'll end up with matches at all. More testing is needed on my part.

C: Instructions on how to construct hash functions based on the other hashing methods?

+1  A: 

Maby this is a little off topic but you can try using PCA http://en.wikipedia.org/wiki/Principal_component_analysis for reducing the dimensionality of the dataset. There should be plenty of PCA modules designed for numPy ( for example: http://folk.uio.no/henninri/pca_module/). The method is rather simple and with a ready to use modules it will be a snap.

Basicly what it does is reduce the number of dimensions ( you should be able to specify desired number) by maximizing variance within the given number of dimensions.

Piotr Duda
I ended up using the MTP toolkit for python to perform PCA on my data. Very very efficient and does exactly what I've been trying to do.
Aaron Merriam
MTP ? do you mean MDP, http://mdp-toolkit.sourceforge.net ?
Denis
+2  A: 

Here are two answers:

B: The Wikipedia page indicates that math.floor() should be used on hashVal: this is how you obtain integers.

C: If you want to use the Hamming method, you can implement it quite simply: each Hamming hash function is simply defined by a coordinate (between 0 and 47) and a bit number (between 0 and 7). You can get the value of an integer at a given bit b with:

bool(i & 2**b)
EOL