views:

111

answers:

2

I'm looking for an answer that scales, but for my specific purpose, I have a 48th dimension vector. This could be represented as an array of 48 integers all between 0 and 255.

I have a large dictionary of these vectors, approximately 25 thousand of them.

I need to be able to take a vector that may or may not be in my database, and quickly find which vector from the database is closest. By closest, I mean in terms of traditional distance formula.

My code will end up in python but this is more a general question.

Brute Force is too slow. I need a near dictionary speed lookup. Anyone have an idea?

+8  A: 

I would suggest implementing a kd-tree on which you can perform Nearest neighbour search. The worst case search time for N points in k dimensions is O(k.N^(1-1/k)) so it should scale sublinearly in N.

If I have time I will come back to this answer and provide a less terse explanation that Wikipedia's.

Since you're working in python this Scipy cookbook entry on kdtrees should help.

Il-Bhima
Effectively quite terse, but at least the pointer seems spot on!
Matthieu M.
thanks for this btw. I did do a lot of looking into this and while kdtrees are pretty cool and I learned a lot, the LSH method mentioned below seems to be the most applicable solution due to the high dimensionality of my problem.
Aaron Merriam
+4  A: 

Another technique that will prove useful is locality sensitive hashing: http://en.wikipedia.org/wiki/Locality_sensitive_hashing

Its not clear from your question whether you need -exact- nearest neighbors. If you are happy with returning a vector that is approximately the nearest neighbor, there are faster solutions. See here (http://www.cs.umd.edu/~mount/ANN/)

Aaron
LSH seems to be best for me so far.http://www.mit.edu/~andoni/LSH/ has been a great resource. The 2006 paper on the algorithm has been the most helpful.
Aaron Merriam