views:

235

answers:

1

Imagine I have a table which stores a series of sparse vectors. A sparse vector means that it stores only the nonzero values explicitly in the data structure. I could have a 1 million dimensional vector, but I only store the values for the dimensions which are nonzero. So the size is proportional to the number of nonzero entries, not the dimensionality of the vector.

Table definition would be something like this: vector_id : int dimension : int value : float

Now, in normal programming land I can compute the inner product or dot product of two vectors in O(|v1| + |v2|) time. Basically the algorithm is to store the sparse vectors sorted by dimension and iterate through the dimensions in each until you find collisions between dimensions and multiply the values of the shared dimension and keep adding those up until you get to the end of either one of the vectors.

What's the fastest way to pull this off in SQL?

+2  A: 

You should be able to replicate this algorithm in one query:

select sum(v1.value * v2.value)
from vectors v1
inner join vectors v2
on v1.dimension = v2.dimension
where v1.vector_id = ...
and v2.vector_id = ...
dpmattingly
So how would you index the table? By (vector_id, dimension)?
Chris Harris
Indexing by (vector_id, dimension) makes the most sense, since those should define a unique record in the table.
dpmattingly
This is basically what I came up with -- until anyone else posts something faster I'll give it to you. Thanks!
Chris Harris