views:

63

answers:

1

I have a dataset of approximately 100,000 (X, Y) pairs representing points in 2D space. For each point, I want to find its k-nearest neighbors.

So, my question is - what data-structure / algorithm would be a suitable choice, assuming I want to absolutely minimise the overall running time?

I'm not looking for code - just a pointer towards a suitable approach. I'm a bit daunted by the range of choices that seem relevent - quad-trees, R-trees, kd-trees, etc.

I'm thinking the best approach is to build a data structure, then run some kind of k-Nearest Neighbor search for each point. However, since (a) I know the points in advance, and (b) I know I must run the search for every point exactly once, perhaps there is a better approach?

Some extra details:

  • Since I want to minimise the entire running time, I don't care if the majority of time is spent on structure vs search.
  • The (X, Y) pairs are fairly well spread out, so we can assume an almost uniform distribution.
+2  A: 

If k is relatively small (<20 or so) and you have an approximately uniform distribution, create a grid that overlays the range where the points fall, chosen so that the average number of points per grid is comfortably higher than k (so that a centrally-located point will usually get its k neighbors in that one grid point). Then create a set of other grids set half-off from the first (overlapping) along each axis. Now for each point, compute which grid element it falls into (since the grids are regular, no searching is required) and pick the one of four (or howevermany overlapping grids you have) that has that point closest to its center.

Within each grid element, the points should be sorted in one coordinate (let's say x). Starting at the element you chose (find it using bisection), walk outwards along the sorted list until you have found k items (again, if k is small, the fastest way to maintain a list of the k best hits is with binary insertion sort, letting the worst match fall off the end when you insert; insertion sort generally beats everything else up to about 30 items on modern hardware). Keep going until your most distant nearest neighbor is closer to you than the next points away from you in x (i.e. not counting their y-offset, so there could be no new point that could be closer than the kth-closest found so far).

If you do not have k points yet, or you have k points but one or more walls of the grid element are closer to your point of interest than the farthest of the k points, add the relevant adjacent grid elements into the search.

This should give you performance of something like O(N*k^2), with a relatively low constant factor. If k is large, then this strategy is too simplistic and you should choose an algorithm that is linear or log-linear in k, like kd-trees can be.

Rex Kerr
Thanks very much, this sounds like a very good approach.
Bio.X2Y
The "grid bucket" concept is very helpful. But it's not so obvious whether or how much the overlapping multiple grids will speed things up. @Bio may want to code a few different configurations of overlapping grids and compare performance on some sample data. I would include a single-grid algorithm, @Rex's four(?)-grid algorithm, and probably a two-grid algorithm in which each grid's vertices are at the other's bucket centers.
aschepler
@aschepler - Agreed, and there's also a balance between number of separate grids and the size of each grid element. Four grids guarantees that you'll always be at least (d/4) away from a wall. A single finely-spaced grid with a good pruning algorithm will probably be fastest; it's just more complicated to implement. Having sorts in different orientations would probably help even more than having overlapping grids, but again, this becomes more complicated yet.
Rex Kerr