Fast lookup for dictionary vector to a given vector. High dimensions

views:

111

answers:

+5 Q:

Fast lookup for dictionary vector to a given vector. High dimensions

I'm looking for an answer that scales, but for my specific purpose, I have a 48th dimension vector. This could be represented as an array of 48 integers all between 0 and 255.

I have a large dictionary of these vectors, approximately 25 thousand of them.

I need to be able to take a vector that may or may not be in my database, and quickly find which vector from the database is closest. By closest, I mean in terms of traditional distance formula.

My code will end up in python but this is more a general question.

Brute Force is too slow. I need a near dictionary speed lookup. Anyone have an idea?

+8 A:

I would suggest implementing a kd-tree on which you can perform Nearest neighbour search. The worst case search time for N points in k dimensions is O(k.N^(1-1/k)) so it should scale sublinearly in N.

If I have time I will come back to this answer and provide a less terse explanation that Wikipedia's.

Since you're working in python this Scipy cookbook entry on kdtrees should help.

Il-Bhima 2010-07-02 07:49:37

Effectively quite terse, but at least the pointer seems spot on!

Matthieu M. 2010-07-02 12:02:54

thanks for this btw. I did do a lot of looking into this and while kdtrees are pretty cool and I learned a lot, the LSH method mentioned below seems to be the most applicable solution due to the high dimensionality of my problem.

Aaron Merriam 2010-07-17 18:39:47

+4 A:

Another technique that will prove useful is locality sensitive hashing: http://en.wikipedia.org/wiki/Locality_sensitive_hashing

Its not clear from your question whether you need -exact- nearest neighbors. If you are happy with returning a vector that is approximately the nearest neighbor, there are faster solutions. See here (http://www.cs.umd.edu/~mount/ANN/)

Aaron 2010-07-02 17:18:03

LSH seems to be best for me so far.http://www.mit.edu/~andoni/LSH/ has been a great resource. The 2006 paper on the algorithm has been the most helpful.

Aaron Merriam 2010-07-17 18:40:50

ansaurus

tags:

views:

answers:

Fast lookup for dictionary vector to a given vector. High dimensions

related questions