ansaurus

Question

Efficient comparison of 1 million vectors containing (float, integer) tuples

Answer 1

+1 A:

Can't you sort the tuples and perform binary search on the sorted array ? I assume your database is done once for all, and the positions of the entries is not important. You can sort this array so that the tuples are in a given order. When a tuple is entered by the user, you just look in the middle of the sorted array. If the query value is larger of the center value, you repeat the work on the upper half, otherwise on the lower one.

Worst case is log(n)

Stefano Borini 2010-02-22 12:47:43

Answer 2

A:

If you can "map" your reference data to x-y coordinates on a plane there is a nifty technique which allows you to select all points under a given distance/tolerance (using Hilbert curves).

Here is a detailed example.

p.marino 2010-02-22 12:55:39

Answer 3

+2 A:

An efficient linear scan of 1 million records of that type should take a fraction of a second on a modern machine; a compiled loop should be able to do it at about memory bandwidth, which would transfer that in a two or three milliseconds.

But, if you really need to optimise this, you could construct a hash table of the integer values, which would divide the job by the number of integer bins. And, if the data is stored sorted by the floats, that improves the locality of matching by those; you know you can stop once you're out of tolerance. Storing the offsets of each of a number of bins would give you a position to start.

I guess I don't see the need for a fancy algorithm yet... describe the problem a bit more, perhaps (you can assume a fairly high level of chemistry and physics knowledge if you like; I'm a physicist by training)?

Ok, given the extra info, I still see no need for anything better than a direct linear search, if there's only 1 million reference vectors and the algorithm is that simple. I just tried it, and even a pure Python implementation of linear scan took only around three seconds. It took several times longer to make up some random data to test with. This does somewhat depend on the rather lunatic level of optimisation in Python's sorting library, but that's the advantage of high level languages.

from cmath import *
import random
r = [(random.uniform(0,20), random.randint(1,18)) for i in range(1000000)]
# this is a decorate-sort-undecorate pattern
# look for matches to (7,9)
# obviously, you can use whatever distance expression you want
zz=[(abs((7-x)+(9-y)),x,y) for x,y in r]
zz.sort()
# return the 50 best matches
[(x,y) for a,x,y in zz[:50]]

Andrew McGregor 2010-02-22 12:59:49

`zz=[(abs(7-a[0])+abs(9-a[1]),a) for a in r]` gives somewhat more sensible answers, of course.

Andrew McGregor 2010-02-22 13:50:26

Your time estimations in the fist paragraph seem to be based on insufficient data, as you do not know how long the lists are.

Svante 2010-02-22 14:00:50

True, I suppose. It seems a bit underspecified. I mean, this could take anything up to a Voronoi diagram to solve properly, but without knowing some more about the problem it's hard to help.

Andrew McGregor 2010-02-22 21:17:32

Answer 4

+2 A:

You should add a physicist to the project :-) This is a very common problem to compare functions e.g. look here:

In the first link you can read: "The SEQUEST algorithm for analyzing mass spectra makes use of autocorrelation in conjunction with cross-correlation to score the similarity of an observed spectrum to an idealized spectrum representing a peptide."

Karussell 2010-02-22 13:07:51

Answer 5

A:

One approach we are trying ourselves which allows for the discrepancies between query and reference is by binning the float values. We are testing and want to offer the user the choice of different bin sizes. Bin sizes will be 0.1 , 0.2 , 0.3 or 0.4. So binning leaves us with between 50 and 200 bins, each with a corresponding integer value between 0 and 18, where 0 means there was no value within that bin. The reference data can be pre-binned and stored in the database. We can then take the binned query data and compare it with the reference data. One approach could be for all bins, subtract the query integer value from the reference integer value. By summing up all differences we get the similarity score, with the the most similar reference entries resulting in the lowest scores.

Another (simpler) search option we want to offer is where the user only enters the float values. The integer values in both query as reference list can then be set to 1. We then use Hamming distance to compute the difference between the query and the reference binned values. I have previously asked about an efficient algorithm for that search.

This binning is only one way of achieving our goal. I am open to other suggestions. Perhaps we can use Principal Component Analysis (PCA), as described here

Simmer 2010-02-22 13:44:44

Does this mean that the tuples are or can be sorted by the float value, within a single data set?

Svante 2010-02-22 14:03:37

Binning leaves you with a bias when the query value lies very near the edge of a bin.

dmckee 2010-02-22 14:23:25

@Svante: yes they can be sorted by the float value

Simmer 2010-02-23 14:26:23

Do the tuples, sorted by float value, describe a continuous curve?

Svante 2010-02-23 19:50:44

@Svante: no, they do not describe a continuous curve. They are discrete values.

Simmer 2010-02-24 11:41:01

ansaurus

tags:

views:

answers:

Efficient comparison of 1 million vectors containing (float, integer) tuples

related questions