I have an item-to-item similarity matrix set up with these tables:
items (id, ...) (Primary key `id`)
similarities (item1_id, item2_id, similarity) (Index on `item1_id` and `item2_id`)
The similarities
tables contains pairs of ids with a similarity index, i.e:
item1_id item2_id similarity
1 2 0.3143
2 3 0.734
For efficient storage "reverse pairs" are omitted, i.e. there's only one pair (1,2), there's no redundant pair (2,1). That means the foreign key for an item may be either item1_id
or item2_id
.
Now I want to find items that are similar to a bunch of other items, sorted by descending similarity. I'm using this query:
SELECT `Item`.*
FROM `items` AS `Item`
LEFT JOIN `similarities` AS `Similarity`
ON (`Item`.`id` = `Similarity`.`item1_id`
AND `Similarity`.`item2_id` IN (1, 2, 3, ...))
OR (`Item`.`id` = `Similarity`.`item2_id`
AND `Similarity`.`item1_id` IN (1, 2, ,3, ...))
WHERE `Similarity`.`item1_id` IN (1, 2, 3, ...)
OR `Similarity`.`item2_id` IN (1, 2, 3, ...)
GROUP BY `Item`.`id`
ORDER BY `Similarity`.`similarity` desc
It's extremely slow though, it takes 4-5 seconds for ~100,000 items and ~30,000 similarity pairs. It seems the JOIN is extremely costly. Here's the query EXPLAIN
ed:
select_type table type possible_keys key key_len ref rows Extra
SIMPLE Similarity index_merge item1_id,item2_id item1_id,item2_id 110,110 NULL 31 Using sort_union(item1_id,...
SIMPLE Item ALL PRIMARY NULL NULL NULL 136600 Using where; Using join buffer
What can I do to speed this up? Worst case I would do it in two separate queries, but I'd prefer one JOIN query if possible.