views:

224

answers:

4

I have a system that stores vectors and allows a user to find the n most similar vectors to the user's query vector. That is, a user submits a vector (I call it a query vector) and my system spits out "here are the n most similar vectors." I generate the similar vectors using a KD-Tree and everything works well, but I want to do more. I want to present a list of the n most similar vectors even if the user doesn't submit a complete vector (a vector with missing values). That is, if a user submits a vector with three dimensions, I still want to find the n nearest vectors (stored vectors are of 11 dimensions) I have stored.

I have a couple of obvious solutions, but I'm not sure either one seem very good:

  1. Create multiple KD-Trees each built using the most popular subset of dimensions a user will search for. That is, if a user submits a query vector of thee dimensions, x, y, z, I match that query to my already built KD-Tree which only contains vectors of three dimensions, x, y, z.

  2. Ignore KD-Trees when a user submits a query vector with missing values and compare the query vector to the vectors (stored in a table in a DB) one by one using something like a dot product.

This has to be a common problem, any suggestions? Thanks for the help.

A: 

Your second option looks like a reasonable solution for what you want.

You could also populate the missing dimensions with the most important( or average or whatever you think it should be) values if there are any.

leiz
+1  A: 

Your first solution might be fastest for queries (since the tree-building doesn't consider splits in directions that you don't care about), but it would definitely use a lot of memory. And if you have to rebuild the trees repeatedly, it could get slow.

The second option looks very slow unless you only have a few points. And if that's the case, you probably didn't need a kd-tree in the first place :)

I think the best solution involves getting your hands dirty in the code that you're working with. Presumably the nearest-neighbor search computes the distance between the point in the tree leaf and the query vector; you should be able to modify this to handle the case where the point and the query vector are different sizes. E.g. if the points in the tree are given in 3D, but your query vector is only length 2, then the "distance" between the point (p0, p1, p2) and the query vector (x0, x1) would be

sqrt( (p0-x0)^2 + (p1-x1)^2 )

I didn't dig into the java code that you linked to, but I can try to find exactly where the change would need to go if you need help.

-Chris

PS - you might not need the sqrt in the equation above, since distance squared is usually equivalent.

EDIT Sorry, didn't realize it would be so obvious in the source code. You should use this version of the neighbor function:

nearest(double [] key, int n, Checker<T> checker)

And implement your own Checker class; see their EuclideanDistance.java to see the Euclidean version. You may also need to comment out any KeySizeException that the query code throws, since you know that you can handle differently sized keys.

celion
A: 

You could try using the existing KD tree -- by taking both branches when the split is for a dimension that is not supplied by the source vector. This should take less time than doing a brute force search, and might be less trouble than trying to maintain a bunch of specialized trees for dimension subsets.

You would need to adapt your N-closest algorithm (without more info I can't advise you on that...), and for distance you would use the sum of the squares of only those elements supplied by the source vector.

comingstorm
A: 

Here's what I ended up doing: When a user didn't specify a value (when their query vector lacked a dimension), I I simply adjusted my matching range (in the API) to something huge so that I match any value.

labratmatt