views:

211

answers:

3

List1 contains a high number (~7^10) of N-dimensional points (N <=10), List2 contains the same or fewer number of N-dimensional points (N <=10).

My task is this: I want to check which point in List2 is closest (euclidean distance) to a point in List1 for every point in List1 and subsequently perform some operation on it. I have been doing it the simple- the nested loop way when I didn't have more than 50 points in List1, but with 7^10 points, this obviously takes up a lot of time.

What is the fastest way to do this? Any concepts from Computational Geometry might help?

EDIT: I have the following in place, I have built a kd-tree out of List2 and then now I am doing a nearest-neighborhood search for each point in List1. Now as I originally pointed out, List1 has 7^10 points, and hence though I am saving on the brute force, Euclidean distance method for every pair, the sheer large number of points in List1 is causing a lot of time consumption. Is there any way I can improve this?

+2  A: 
Charles Stewart
Even if the problem were to find the closest pair of points overall (which it isn't -- he actually wants the closest partner for *every* point in L1), the closest pair need not be on either hull. Some point near the centre of L1 could be closer to some point near the centre of L2 than any hull-point of L1 is to any hull-point of L2.
j_random_hacker
@j_random: I've edited the post to reflect your comment, and describe a revised algorithm.
Charles Stewart
Charles: I am sorry that I don't understand much of Computational geometry, hence the convex hull approach may not be something I can quickly use. Thanks a lot for your answer.
Amit
@Charles: Kudos for the update. But... :) If the volumes of the hulls are disjoint, the nearest point to any point on L1's hull must be somewhere on L2's hull, but that point might be somewhere along a line segment rather than at a corner point defined by an actual point from L2. (Much easier to show this on a diagram I'm afraid.) Also not convinced yet that lines through the centres of mass have the property you ascribe (though it may be a good heuristic) -- you can move the centre of mass anywhere by adding many points near the boundary. (Sorry to be picky... :))
j_random_hacker
Charles Stewart
+6  A: 

Well a good way would be to use something like a kd-tree and perform nearest neighbour searching. Fortunately you do not have to implement this data structure yourself, it has been done before. I recommend this one, but there are others:

http://www.cs.umd.edu/~mount/ANN/

PeterK
Awesome. FTW. Checking :)
Amit
Since my existing code is in Python, I am not going to use this library. I am looking to use http://gamera.sourceforge.net/doc/html/kdtree.html. But, the most important thing is the concept of using kd-trees for the same. Thanks!
Amit
PeterK: Sorry for toggling the "Accept". Can you please look at the edited question now?
Amit
I think we are running out of options. There are two possibilities i think: first, you could use some kind of approximation while searching for NN. The mentoined ANN library allows this, don't know about the others though. Second option would be to somehow preprocess the List1 - like computing "clusters" of some size. By clusters i mean points which are very similar, thus one of them could be picked as a representant of the cluster. If you then run a NN search for more than one neighbour, you might be able to get good results (but it still depends on what exactly you want to do).
PeterK
Oh, one more thing: you could go the parallel way and do more searches at a time. This of course would only be reasonable if you have a multicore processor and your systems allows threading. Moreover, you need the kd-tree to be thread-safe, at least when it comes to searching.
PeterK
+1  A: 

kd-tree is pretty fast. I've used the algorithm in this paper and it works well Bentley - K-d trees for semidynamic point sets

I'm sure there are libraries around, but it's nice to know what's going on sometimes - Bentley explains it well.

Basically, there are a number of ways to search a tree: Nearest N neighbors, All neighbors within a given radius, nearest N neighbors within a radius. Sometimes you want to search for bounded objects.

The idea is that the kdTree partitions the space recursively. Each node is split in 2 down the axis in one of the dimensions of the space you are in. Ideally it splits perpendicular to the node's longest dimension. You should keep splitting the space until you have about 4 points in each bucket.

Then for every query point, as you recursively visit nodes, you check the distance from to the partition wall for the particular node you are in. You descend both nodes (the one you are in and its sibling) if the distance to the partition wall is closer than the search radius. If the wall is beyond the radius, just search children of the node you are in.

When you get to a bucket (leaf node), you test the points in there to see if they are within the radius.

If you want the closest point, you can start with a massive radius, and pass a pointer or reference to it as you recurse - and in that way you can shrink the search radius as you find close points - and home in on the closest point pretty fast.

Actually, you don't really need to split according to the longest dimension, one technic which work pretty well is to split on each dimension alternatively, always in the same order. It's perhaps not the most efficient one, but it's predictable and this has some advantages too.
Matthieu M.
Thanks! I think I'll try that. I guess it will speed up the build because figuring out which is the longest dimension is probably expensive ? - even with the optimization in Bentley's paper- i.e. (using a subset of the points to measure).
Figuring out which dimension has the biggest spread is really expensive. Introducing an approximation (checking a subset of the points) gives you a considerable speedup (depending on the size of the subset of course). From my experience, it is entirely worth it.
PeterK