views:

94

answers:

1

The basic table schema looks something like this (I'm using MySQL BTW):

integer unsigned vector-id
integer unsigned fk-attribute-id
float attribute-value
primary key (vector-id,fk-attribute-id)

The vector is represented as multiple records in the table with the same vector-id

I need to build a separate table with the dot product (also euclidean distance) of all vectors that exist in this table. So, I need a result table that looks like this:

integer unsigned fk-vector-id-a
integer unsigned fk-vector-id-b
float dot-product


...and one like this...

integer unsigned fk-vector-id-a
integer unsigned fk-vector-id-b
float euclidean-distance

What is the best query structure to produce my result?

With very large vectors, is a relational database the best approach to solve this problem, or should I internalize the vectors in an application and do the calculation there?

+2  A: 
INSERT
INTO    dot_products
SELECT  v1.vector_id, v2.vector_id, SUM(v1.attribute_value * v2.attribute_value)
FROM    attributes v1
JOIN    attributes v2
ON      v2.attribute_id = v1.attribute_id
GROUP BY
        v1.vector_id, v2.vector_id

In MySQL, this can be faster:

INSERT
INTO    dot_products
SELECT  v1.vector_id, v2.vector_id,
        (
        SELECT  SUM(va1.attribute_value * va2.attribute_value)
        FROM    attributes va1
        JOIN    attributes va2
        ON      va2.attribute_id = va1.attribute_id
        WHERE   va1.vector_id = v1.vector_id
                AND va2.vector_id = v2.vector_id
        )
FROM    vector v1
CROSS JOIN
        vector v2
Quassnoi
This will work but what are the performance characteristics of the query? Will the JOIN take a year to complete on a large vector table?
JR Lawhorne
`@JR Lawhorne`: if you have an index on `(atttribute_id, vector_id)`, it will most probably be faster than pulling the resultset into an application, building the new values and inserting them back.
Quassnoi
I suggest adding `WHERE v1.vector_id < v2.vector_id` to avoid computing all dot products twice.
j_random_hacker