ansaurus

Question

KD-Trees and missing values (vector comparison)

Answer 1

A:

Your second option looks like a reasonable solution for what you want.

You could also populate the missing dimensions with the most important( or average or whatever you think it should be) values if there are any.

leiz 2009-12-22 20:35:21

Answer 2

+1 A:

Your first solution might be fastest for queries (since the tree-building doesn't consider splits in directions that you don't care about), but it would definitely use a lot of memory. And if you have to rebuild the trees repeatedly, it could get slow.

The second option looks very slow unless you only have a few points. And if that's the case, you probably didn't need a kd-tree in the first place :)

I think the best solution involves getting your hands dirty in the code that you're working with. Presumably the nearest-neighbor search computes the distance between the point in the tree leaf and the query vector; you should be able to modify this to handle the case where the point and the query vector are different sizes. E.g. if the points in the tree are given in 3D, but your query vector is only length 2, then the "distance" between the point (p0, p1, p2) and the query vector (x0, x1) would be

sqrt( (p0-x0)^2 + (p1-x1)^2 )

I didn't dig into the java code that you linked to, but I can try to find exactly where the change would need to go if you need help.

-Chris

PS - you might not need the sqrt in the equation above, since distance squared is usually equivalent.

EDIT Sorry, didn't realize it would be so obvious in the source code. You should use this version of the neighbor function:

nearest(double [] key, int n, Checker<T> checker)

And implement your own Checker class; see their EuclideanDistance.java to see the Euclidean version. You may also need to comment out any KeySizeException that the query code throws, since you know that you can handle differently sized keys.

celion 2009-12-22 23:14:27

Answer 3

A:

You could try using the existing KD tree -- by taking both branches when the split is for a dimension that is not supplied by the source vector. This should take less time than doing a brute force search, and might be less trouble than trying to maintain a bunch of specialized trees for dimension subsets.

You would need to adapt your N-closest algorithm (without more info I can't advise you on that...), and for distance you would use the sum of the squares of only those elements supplied by the source vector.

comingstorm 2009-12-23 00:45:12

Answer 4

A:

Here's what I ended up doing: When a user didn't specify a value (when their query vector lacked a dimension), I I simply adjusted my matching range (in the API) to something huge so that I match any value.

labratmatt 2010-03-26 04:00:38

ansaurus

tags:

views:

answers:

KD-Trees and missing values (vector comparison)

related questions