tags:

views:

35

answers:

3

I have an item-to-item similarity matrix set up with these tables:

items (id, ...) (Primary key `id`)
similarities (item1_id, item2_id, similarity) (Index on `item1_id` and `item2_id`)

The similarities tables contains pairs of ids with a similarity index, i.e:

item1_id  item2_id  similarity
1         2         0.3143
2         3         0.734

For efficient storage "reverse pairs" are omitted, i.e. there's only one pair (1,2), there's no redundant pair (2,1). That means the foreign key for an item may be either item1_id or item2_id.

Now I want to find items that are similar to a bunch of other items, sorted by descending similarity. I'm using this query:

SELECT    `Item`.*
FROM      `items` AS `Item`
LEFT JOIN `similarities` AS `Similarity`
       ON (`Item`.`id` = `Similarity`.`item1_id`
              AND `Similarity`.`item2_id` IN (1, 2, 3, ...))
          OR (`Item`.`id` = `Similarity`.`item2_id`
              AND `Similarity`.`item1_id` IN (1, 2, ,3, ...))
WHERE     `Similarity`.`item1_id` IN (1, 2, 3, ...)
          OR `Similarity`.`item2_id` IN (1, 2, 3, ...)
GROUP BY  `Item`.`id`
ORDER BY  `Similarity`.`similarity` desc

It's extremely slow though, it takes 4-5 seconds for ~100,000 items and ~30,000 similarity pairs. It seems the JOIN is extremely costly. Here's the query EXPLAINed:

select_type  table       type         possible_keys      key                key_len  ref   rows    Extra
SIMPLE       Similarity  index_merge  item1_id,item2_id  item1_id,item2_id  110,110  NULL  31      Using sort_union(item1_id,...
SIMPLE       Item        ALL          PRIMARY            NULL               NULL     NULL  136600  Using where; Using join buffer

What can I do to speed this up? Worst case I would do it in two separate queries, but I'd prefer one JOIN query if possible.

+1  A: 

I didn't actually try this but maybe it points you in the right direction. The idea is to make a temp result of the UNION of (unique) id, similarity pairs from similarities, then join items with that.

SELECT Item.*, s.other_item_id, s.similarity
FROM items AS Item
JOIN
    (
    SELECT item1_id AS id, item2_id AS other_item_id, similarity FROM similarities
    UNION
    SELECT item2_id AS id, item1_id AS other_item_id, similarity FROM similarities
    ) AS s ON s.id = items.id
WHERE items.id IN (1, 2, 3, ...)
ORDER BY s.similarity DESC;

In your original query you don't need to restrict the ids from similarities in both the JOIN condition and the WHERE clause.

gregjor
So basically my two-seperate-queries strategy rolled into one query? :) I'll give this a try. I have to see if how to push this through my DAL though.
deceze
It would help if you can give an example of the desired result. Do you want to see the other item ID a given item is similar to?
gregjor
I only want the records of the items, like a normal `SELECT * FROM items`. The similarity is only the condition, I don't need to fetch it. In fact, I'll probably be using a `LIMIT` to only fetch a number of "top similar" items. You could describe the query as "Find 10 items that are most similar to these items (1, 2, 3, ...)."
deceze
OK... so you will have a WHERE condition that limits the result to similarity in a specific range or something?
gregjor
Comment timing overlap... :) It'll be a `LIMIT`ed query, hence the sorting. Also, similarities of 0 are omitted in the matrix, so as long as there *is* a similarity and it's sorted in descending order, that's fine.
deceze
I see. I'm not sure my solution gets exactly what you want but maybe it points you to another way to do the join without so many conditions. You could make a VIEW from the UNION'd SELECTs, or a temp table if similarities doesn't change often and you need to run a lot of queries against it.
gregjor
Hmm, overall the query is about as slow as my original one, since it still requires a ginormous JOIN...
deceze
This brought me on the right track though. Thanks!
deceze
+1  A: 

I am wondering whether joining to the items table twice will perform better than the two queries. Pardon the psuedo-code-ish SELECT portion of this statement - I think you'll actually need a CASE for every field value...

SELECT    
CASE WHEN `Item2`.`id` IS NULL THEN 
  `Item1`.`id`
ELSE `Item2`.`id`
END,

SELECT    
CASE WHEN `Item2`.`id` IS NULL THEN 
  `Item1`.`name`
ELSE `Item2`.`name`
END,

SELECT    
CASE WHEN `Item2`.`id` IS NULL THEN 
  `Item1`.`description`
ELSE `Item2`.`description`
END,

[and so on]

FROM      `items` AS `Item1`
LEFT OUTER JOIN `similarities` AS `Similarity`
       ON (`Item1`.`id` = `Similarity`.`item1_id`
RIGHT OUTER JOIN `items` AS `Item2`
       ON (`Item2`.`id` = `Similarity`.`item2_id`       
WHERE     `Similarity`.`item1_id` IN (1, 2, 3, ...)
          OR `Similarity`.`item2_id` IN (1, 2, 3, ...)
ORDER BY  `Similarity`.`similarity` desc
LesterDove
Uuh, funky, thanks! :) I'm getting a syntax error on the `CASE` statement within the `SELECT` statement though. Just testing with `SELECT Item1.*, Item2.*` gives blazingly fast results though. I'll have to play around with this more. Would be happy if you could improve this as well.
deceze
Testing with `CASE WHEN Item2.id IS NULL THEN Item1.id ELSE Item2.id END CASE` gives me a syntax error near `CASE FROM ...` (the `END CASE` statement) BTW.
deceze
Thanks, I just edited it a bit. Separate CASE for each field name (this is only so that your new query contains the same # of fields as your old one, since you mentioned an existing DAL already.)
LesterDove
I ended up with a different solution, but your answer gave me some inspiration. Thanks!
deceze
A: 

Thanks to the inspirations, I ended up with this query:

SELECT `Item`.*
FROM `items` AS `Item`
JOIN (
    SELECT `item1_id` AS `id`, `similarity`
    FROM   `similarities`
    WHERE  `similarities`.`item2_id` IN (1, 2, 3, ...)
    UNION
    SELECT `item2_id` AS `id`, `similarity`
    FROM   `similarities`
    WHERE  `similarities`.`item1_id` IN (1, 2, 3, ...)
) AS `SimilarityUnion` ON `SimilarityUnion`.`id` = `Item`.`id`
GROUP BY `SimilarityUnion`.`id`
ORDER BY `SimilarityUnion`.`similarity` DESC
deceze