+2  A: 

So the following information can be cached:

  • Norm of the chosen vector
  • The dot product A.B, reusing it for both the numerator and the denominator in a given T(A,B) calculation.

If you only need the N closest vectors or if you are doing this same sorting process multiple times, there may be further tricks available. (Observations like T(A,B)=T(B,A), caching the vector norms for all the vectors, and perhaps some sort of thresholding/spatial sort).

nsanders
Whoit do you meen by "T(A,B) = T(B,A) cutting down the number of comparisons by half" ? I just take one vector and compare it with others. I dont need to compare vectors with each other.
I apologize. I misread and thought you were performing this calculation for multiple vectors as you would in clustering algorithms. You can at least still cache the norm of your chosen vector and reuse the dot product in numerator/denominator.
nsanders
+1  A: 

In short, no, probably not any way to avoid going through all the entries in the database. One qualifier on that; if you have a significant number of repeated vectors, you may be able to avoid reprocessing exact repeats.

McWafflestix
Sorting the vectors ahead of time would make it easier to determine if the vector has been seen before.
emddudley
+1  A: 

Uh, no?

You only have to do all 99,999 against the one you picked (rather than all n(n-1)/2 possible pairs), of course, but that's as low as it goes.


Looking at your response to nsanders's answer, it is clear you are already on top of this part. But I've thought of a special case where computing the full set of comparisons might be a win. If:

  • the list comes in slowly (say your getting them from some data acquisition system at a fixed, low rate)
  • you don't know until the end which one you want to compare to
  • you have plenty of storage
  • you need the answer fast when you do pick one (and the naive approach isn't fast enough)
  • Looks are faster than computing

then you could pre-compute the as the data comes in and just lookup the results per pair at sort time. This might also be effective if you will end up doing many sorts...

dmckee
Any real solution?
The "real" solution is that you *have* to compare them all. Even having a faster pre-comparer doesn't get you out of that requirement.
dmckee
+2  A: 

In order to sort something, you need a sorting key for each item. So you will need to process each entry at least once to calculate the key.

Is that what you think of?

======= Moved comment here:

Given the description you cannot avoid looking at all entries to calculate your similarity factor. If you tell the database to use the similarity factor in the "order by" clause you can let it do all the hard work. Are you familiar with SQL?

Thorbjørn Ravn Andersen
Right! But maybe I can threshold 50% of entries at the start of computation?
How do you know what to cut out if you don't process everything?
Andy Mikula
+3  A: 

Update:

After you made clear that 60 is the dimension of your space, not the length of the vectors, the answer below is not applicable for you, so I'll keep it just for history.


Since your vectors are normalized, you can employ kd-tree to find the neighbors within an MBH of incremental hypervolume.

No database I'm aware of has native support of kd-tree, so you can try to implement the following solution in MySQL, if you are searching for a limited number of closest entries:

  • Store the projections of the vectors to each of 2-dimensional space possible (takes n * (n - 1) / 2 columns)
  • Index each of these columns with a SPATIAL index
  • Pick a square MBR of a given area within any projection. The product of these MBR's will give you a hypercube of a limited hypervolume, which will hold all vectors with a distance not greater than a given one.
  • Find all projections within all MBR's using MBRContains

You'll still need to sort within this limited range of values.

For instance, you have a set of 4-dimensional vectors with magnitude of 2:

(2, 0, 0, 0)
(1, 1, 1, 1)
(0, 2, 0, 0)
(-2, 0, 0, 0)

You'll have to store them as follows:

p12  p13  p14  p23  p24  p34
---  ---  ---  ---  ---  ---
2,0  2,0  2,0  0,0  0,0  0,0
1,1  1,1  1,1  1,1  1,1  1,1
0,2  0,0  0,0  2,0  2,0  0,0
-2,0 -2,0 -2,0 0,0  0,0  0,0

Say, you want similarity with the first vector (2, 0, 0, 0) greater than 0.

This means having the vectors inside the hypercube: (0, -2, -2, -2):(4, 2, 2, 2).

You issue the following query:

SELECT  *
FROM    vectors
WHERE   MBRContains('LineFromText(0 -2, 4 2)', p12)
        AND MBRContains('LineFromText(0 -2, 4 2)', p13)
        …

, etc, for all six columns

Quassnoi
Even knowing neighbors, won't he still have to find all the similarities? I suppose there's a possible gain in grouping for sort routines which do better when the array starts with a significant degree of sortedness.
Nosredna
@Nosredna: If his vectors are normalized (they all are of length 60), then having neighbors inside a hypercube with edge less that 60 means having neighbors with a certain similarity greater than 0. The less is the edge, the greater is the similarity.
Quassnoi
The naive approach is O(n) (or O(n^2) for all possible selections of the preferred vector) in space and time. Is this actually faster or more compact?
dmckee
@dmckee: If he needs all values then he should just count and sort. If he need TOP N values or values within given threshold of similarity, then the indexes will be faster.
Quassnoi
@Quassnoi: Ah. I get it. Thanks.
dmckee
+1  A: 
Nosredna
There is an ambiguity in the term "length", he may (or may not) mean 60 dimensional vectors of unknown magnitude. Or you might have it right and he means vectors of magnitude 60, in which case the simplification of the per-comparison math you suggest applies, but he still have to do 99,999 comparisons at a minimum.
dmckee
Oh I see what you mean I was assuming he meant magnitude.
Nosredna
+1  A: 

If you're willing to live with approximations, there are a few ways you can avoid having to go through the whole database at runtime. In a background job you can start pre-computing pairwise distances between vectors. Doing this for the whole database is a huge computation, but it does not need to be finished for it to be useful (i.e. start computing distances to 100 random vectors for each vector or so. store results in a database).

Then triangulate. if the distance d between your target vector v and some vector v' is large, then the distance between v and all other v'' that are close to v' will be large(-ish) too, so there is no need to compare them anymore (you will have to find acceptable definitions of "large" yourself though). You can experiment with repeating the process for the discarded vectors v'' too, and test how much runtime computation you can avoid before the accuracy starts to drop. (make a test set of "correct" results for comparisons)

good luck.

sds

sds
Thank you! But unfurtunatly it doesn't work. The vectors in DB are not sorted anyway? We cannot say that two adjacent vector in DB are similar.
Why would they need to be sorted? You still need to make a whole bunch of comparisons, it's just that with this approach you can determine entire "regions" in your database that do not need to be checked because they are guaranteed not to contain similar vectors. It's all about removing as much unnecessary computation as possible, but you will still need to do some.
sds
A: 
Andrea Ambu
+18  A: 

First, preprocess your vector list to make each vector normalized.. unit magnitude. Notice now that your comparison function T() now has magnitude terms that become constant, and the formula can be simplified to finding the largest dot product between your test vector and the values in the database.

Now, think of a new function D = the distance between two points in 60D space. This is classic L2 distance, take the difference of each component, square each, add all the squares, and take the square root of the sum. D(A, B) = sqrt( (A-B)^2) where A and B are each 60 dimensional vectors.

This can be expanded, though, to D(A, B) = sqrt(A * A -2*dot(A,B) + B * B). A and B are unit magnitude, then. And the function D is monotonic, so it won't change the sort order if we remove the sqrt() and look at squared distances. This leaves us with only -2 * dot(A,B). Thus, miniumizing distance is exactly equivalent to maximizing dot product.

So the original T() classificiation metric can be simplified into finding the highest dot product between the nornalized vectors. And that comparison is shown equivalent to finding the closest points to the sample point in 60-D space.

So now all you need to do is solve the equivalent problem of "given a normalized point in 60D space, list the 20 points in the database of normalized sample vectors which are nearest to it."

That problem is a well understood one.. it's K Nearest Neighbors. There are many algorithms for solving this. The most common is classic KD trees .

But there's a problem. KD trees have an O(e^D) behavior.. high dimensionality quickly becomes painful. And 60 dimensions is definitely in that extremely painful category. Don't even try it.

There are several alternative general techniques for high D nearest neighbor however. This paper gives a clear method.

But in practice, there's a great solution involving yet another transform. If you have a metric space (which you do, or you wouldn't be using the Tanimoto comparison), you can reduce the dimensionality of the problem by a 60 dimensional rotation. That sounds complex and scary, but it's very common.. it's a form of singular value decomposition, or eigenvalue decomposition. In statistics, it's known as Principal Components Analysis.

Basically this uses a simple linear computation to find what directions your database really spans. You can collapse the 60 dimensions down to a lower number, perhaps as low as 3 or 4, and still be able to accurately determine nearest neighbors. There are tons of software libraries for doing this in any language, see here for example.

Finally, you'll do a classic K nearest neighbors in probably only 3-10 dimensions.. you can experiment for the best behavior. There's a terrific library for doing this called Ranger, but you can use other libraries as well. A great side benefit is you don't even need to store all 60 components of your sample data any more!

The nagging question is whether your data really can be collapsed to lower dimensions without affecting the accuracy of the results. In practice, the PCA decomposition can tell you the maximum residual error for whatever D limit you choose, so you can be assured it works. Since the comparison points are based on a distance metric, it's very likely they are intensely correlated, unlike say hash table values.

So the summary of the above:

  1. Normalize your vectors, transforming your problem into a K-nearest neighbor problem in 60 dimensions
  2. Use Principal Components Analysis to reduce dimensionality down to a manageable limit of say 5 dimensions
  3. Use a K Nearest Neighbor algorithm such as Ranger's KD tree library to find nearby samples.
SPWorley
I'm a bit puzzled by your statement, why do you think 60 dimensions is a lot? High dimensional spaces are spaces with millions of dimensions, and in my experience anything short of a thousand or so dimensions can be dealt with simply using just brute-force (even in the order of 100,000 vectors). Dimensionality reductions like SVD or PCA are usually used to bring millions of dimensions down to a few tens of thousands, but how much information can there be left in a 3 dimensional vector? just my two cents.
sds
@sds: Brute force of course works, but the submitter specifically says he wants to find a method that is not brute-force.60D a lot for the KD partitioning... you get inefficient sorting in high dimensions because each cut can only reduce the average radius by 2^(1/D). By reducing the dimensionality, you'll get a KD tree that could find the 20 closest points in roughly 30 steps. (the algorithm is now nicely O(log(N)) for lookup cost, so adding more points is cheap). How many dimensions you can reduce to is data dependent, but 3-5 is typical, and the PCA fit will tell you the residual error.
SPWorley
@Arno Setagaya: That's a fair point I guess. It still sounds like you'll be throwing away a whole lot of useful information to shave off a second or so of computation, but like you said, that was the question. I'd be interested to hear what the OP ended up trying, and what gains it bought him.
sds
It is my impression, that the method you are proposing would give approximate results, i.e. the top 20 vectors after normalization, reducing dimensionality sometimes will differ from the top 20 found by using Tanimoto Classifier on original vectors. Author hasn't yet stated if it is ok for his problem, so maybe you should make this point clearer in your answer. Anyway, your suggestion seems well thought out and nicely explained.
Juozas Kontvainis
@Juozas, no, you can find the top 20 completely accurate. The dimensional reduction tells you the maximum residual error of the truncated dimensiona, and you can use that as an extra required radius when searching the KD to guarantee you find all points that have any possibility of influencing your best candidate ranking.
SPWorley
you just gave a summary worth 30 hours of lectures. quite dense.
Andreas Petersson
@Andreas: thanks. Too bad Lerax didn't like it. :-)
SPWorley
A: 

Another take on this is the all pair problem with a given threshold for some similarity function. Have a look at bayardo's paper and code here http://code.google.com/p/google-all-pairs-similarity-search/

I do not know if your similarity function matches the approach, but if so, that is another tack to look at. It would also require normalized and sorted vectors in any case.

Philippe Ombredanne