views:

379

answers:

5

I have a database with 500,000 points in a 100 dimensional space, and I want to find the closest 2 points. How do I do it?

Update: Space is Euclidean, Sorry. And thanks for all the answers. BTW this is not homework.

+8  A: 

There's a chapter in Introduction to Algorithms devoted to finding two closest points in two-dimensional space in O(n*logn) time. You can check it out on google books. In fact, I suggest it for everyone as the way they apply divide-and-conquer technique to this problem is very simple, elegant and impressive.

Although it can't be extended directly to your problem (as constant 7 would be replaced with 2^101 - 1), it should be just fine for most datasets. So, if you have reasonably random input, it will give you O(n*logn*m) complexity where n is the number of points and m is the number of dimensions.

edit
That's all assuming you have Euclidian space. I.e., length of vector v is sqrt(v0^2 + v1^2 + v2^2 + ...). If you can choose metric, however, there could be other options to optimize the algorithm.

Nikita Rybak
+3  A: 

You could try the ANN library, but that only gives reliable results up to 20 dimensions.

dalle
Thanks. ANN is just what I was looking for. Hopefully it can hold everything in RAM.
louzer
ANN is easy to use, but it should be noted that it is an approximate nearest neighbor implementation, so isn't guaranteed to be correct.
chuck taylor
+4  A: 

Run PCA on your data to convert vectors from 100 dimensions to say 20 dimensions. Then create a K-Nearest Neighbor tree (KD-Tree) and get the closest 2 neighbors based on euclidean distance.

Generally if no. of dimensions are very large then you have to either do a brute force approach (parallel + distributed/map reduce) or a clustering based approach.

Hasan Khan
Thanks. I am reducing the dimensions as per your suggestions.
louzer
+4  A: 

Use a kd tree. You're looking at a nearest neighbor problem and there are highly optimized data structures for handling this exact class of problems.

http://en.wikipedia.org/wiki/Kd-tree

P.S. Fun problem!

Stefan Mai
+5  A: 

Use the data structure known as a KD-TREE. You'll need to allocate a lot of memory, but you may discover an optimization or two along the way based on your data.

http://en.wikipedia.org/wiki/Kd-tree.

My friend was working on his Phd Thesis years ago when he encountered a similar problem. His work was on the order of 1M points across 10 dimensions. We built a kd-tree library to solve it. We may be able to dig-up the code if you want to contact us offline.

Here's his published paper: http://www.elec.qmul.ac.uk/people/josh/documents/ReissSelbieSandler-WIAMIS2003.pdf

selbie
kdtrees make it easy to find a nearest neighbor to a given point in O(log n) time, as I remember. Is there an optimization to find the closest pair of points in less than O(n log n)?
rampion
-1, also according to wikipedia kD-tree are efficient if N >> 2^k (where k is dimensions and N number of points; in this case 2^100 >> 5e5 and the answer is completely misleading)
Unreason