How to find the closest 2 points in a 100 dimensional space with 500,000 points?

views:

379

answers:

+9 Q:

How to find the closest 2 points in a 100 dimensional space with 500,000 points?

I have a database with 500,000 points in a 100 dimensional space, and I want to find the closest 2 points. How do I do it?

Update: Space is Euclidean, Sorry. And thanks for all the answers. BTW this is not homework.

+8 A:

There's a chapter in Introduction to Algorithms devoted to finding two closest points in two-dimensional space in O(n*logn) time. You can check it out on google books. In fact, I suggest it for everyone as the way they apply divide-and-conquer technique to this problem is very simple, elegant and impressive.

Although it can't be extended directly to your problem (as constant 7 would be replaced with 2^101 - 1), it should be just fine for most datasets. So, if you have reasonably random input, it will give you O(n*logn*m) complexity where n is the number of points and m is the number of dimensions.

edit
That's all assuming you have Euclidian space. I.e., length of vector v is sqrt(v0^2 + v1^2 + v2^2 + ...). If you can choose metric, however, there could be other options to optimize the algorithm.

Nikita Rybak 2010-10-10 05:39:57

+3 A:

You could try the ANN library, but that only gives reliable results up to 20 dimensions.

dalle 2010-10-10 05:50:21

Thanks. ANN is just what I was looking for. Hopefully it can hold everything in RAM.

louzer 2010-10-11 14:36:16

ANN is easy to use, but it should be noted that it is an approximate nearest neighbor implementation, so isn't guaranteed to be correct.

chuck taylor 2010-10-12 19:36:14

+4 A:

Run PCA on your data to convert vectors from 100 dimensions to say 20 dimensions. Then create a K-Nearest Neighbor tree (KD-Tree) and get the closest 2 neighbors based on euclidean distance.

Generally if no. of dimensions are very large then you have to either do a brute force approach (parallel + distributed/map reduce) or a clustering based approach.

Hasan Khan 2010-10-10 06:00:07

Thanks. I am reducing the dimensions as per your suggestions.

louzer 2010-10-11 14:35:48

+4 A:

Use a kd tree. You're looking at a nearest neighbor problem and there are highly optimized data structures for handling this exact class of problems.

http://en.wikipedia.org/wiki/Kd-tree

P.S. Fun problem!

Stefan Mai 2010-10-10 06:05:16

+5 A:

Use the data structure known as a KD-TREE. You'll need to allocate a lot of memory, but you may discover an optimization or two along the way based on your data.

http://en.wikipedia.org/wiki/Kd-tree.

My friend was working on his Phd Thesis years ago when he encountered a similar problem. His work was on the order of 1M points across 10 dimensions. We built a kd-tree library to solve it. We may be able to dig-up the code if you want to contact us offline.

Here's his published paper: http://www.elec.qmul.ac.uk/people/josh/documents/ReissSelbieSandler-WIAMIS2003.pdf

selbie 2010-10-10 06:19:02

kdtrees make it easy to find a nearest neighbor to a given point in O(log n) time, as I remember. Is there an optimization to find the closest pair of points in less than O(n log n)?

rampion 2010-10-10 14:46:34

-1, also according to wikipedia kD-tree are efficient if N >> 2^k (where k is dimensions and N number of points; in this case 2^100 >> 5e5 and the answer is completely misleading)

Unreason 2010-10-12 13:54:49

ansaurus

tags:

views:

answers:

How to find the closest 2 points in a 100 dimensional space with 500,000 points?

related questions