views:

25

answers:

1

Hi all,

I want to use scipy.spatial's KDTree to find nearest neighbor pairs in a two dimensional array (essentially a list of lists where the dimension of the nested list is 2). I generate my list of lists, pipe it into numpy's array and then create the KDTree instance. However, whenever I try to run "query" on it, I inevitably get weird answers. For example, when I type:

tree = KDTree(array)
nearest = tree.query(np.array[1,1])

nearest prints out (0.0, 0). Currently, I'm using an an array that is basically y = x for the range (1,50) so I expect that I should get the nearest neighbor of (2,2) for (1,1)

What am I doing wrong, scipy gurus?

EDIT: Alternatively, if someone can point me to a KDTree package for python that they've used for nearest neighbor searches of a given point I would love to hear about it.

+2  A: 

I have used scipy.spatial before, and it appears to be a nice improvement (especially wrt the interface) as compared to scikits.ann.

In this case I think you have confused the return from your tree.query(...) call. From the scipy.spatial.KDTree.query docs:

Returns
-------

d : array of floats
    The distances to the nearest neighbors.
    If x has shape tuple+(self.m,), then d has shape tuple if
    k is one, or tuple+(k,) if k is larger than one.  Missing
    neighbors are indicated with infinite distances.  If k is None,
    then d is an object array of shape tuple, containing lists
    of distances. In either case the hits are sorted by distance
    (nearest first).
i : array of integers
    The locations of the neighbors in self.data. i is the same
    shape as d.

So in this case when you query for the nearest to [1,1] you are getting:

distance to nearest: 0.0
index of nearest in original array: 0

This means that [1,1] is the first row of your original data in array, which is expected given your data is y = x on the range [1,50].

The scipy.spatial.KDTree.query function has lots of other options, so if for example you wanted to make sure to get the nearest neighbour that isn't itself try:

tree.query([1,1], k=2)

This will return the two nearest neighbours, which you could apply further logic to such that cases where the distance returned is zero (i.e. the point queried is one of data items used to build the tree) the second nearest neighbour is taken rather than the first.

dtlussier
Thanks kindly. Makes a lot more sense now!
jlv