Say I have two MyISAM tables:
tab_big: id1, id2, id_a, ord (5 billion records)
tab_small: id1, id2, id_b (1 billion records)
CREATE TABLE IF NOT EXISTS `tab_big` (
`id_a` int(10) unsigned NOT NULL,
`id1` int(10) unsigned NOT NULL,
`id2` int(10) unsigned NOT NULL,
`ord` int(10) unsigned NOT NULL DEFAULT '1',
PRIMARY KEY (`id_a`,`id1`,`id2`),
KEY `id1` (`id1`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
CREATE TABLE IF NOT EXISTS `tab_small` (
`id_b` int(10) unsigned NOT NULL,
`id1` int(10) unsigned NOT NULL,
`id2` int(10) unsigned NOT NULL,
PRIMARY KEY (`id_b`,`id1`,`id2`),
KEY `id_b` (`id_b`),
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
All fields are INT. In both tables the combination of the three id fields (respectively id1, id2, id_a and id1, id2, id_b) values is unique, so for both I created a primary key on these three fields.
I need an efficient query that gets unique values of id_a from the first table, where:
- id_b in the second table table is a given value (narrowing it down to ca. 10k entries)
- id1/id2 combination is identical in both tables
- id_a in the first table is not same as either of id1, id2 fields in the tab_small subset (as narrowed down by id_b field); after a bit of fiddling it seems that generating the list (around 200 ids) in php and providing it as text performs better than adding another JOIN).
I believe it's not very cacheable, since both tables change all the time (rows are added).
My current query is pretty straightforward:
SELECT tab_big.id_a FROM tab_big, tab_small
WHERE tab_small.id_b = '$constant'
AND tab_big.id1 = tab_small.id1 AND tab_big.id2 = tab_small.id2
AND tab_big.id_a NOT IN ({comma delimited list of 200 ids})
GROUP BY tab_big.id_a
ORDER BY SUM(tab_big.ord) DESC
LIMIT 10
It works, but not fast enough to really use it. What can be done with it?
EXPLAIN says it first gets a ranged query from tab_big, then applies that to tab_small (Edit: added below). I don't know why (EXPLAIN says the query uses primary keys), but adding a tab_big.id1 index seems to help a bit. Also, trying to make it go the other way around with STRAIGHT_JOIN, first selecting a 10k subset from (smaller) tab_small and then using it to search in (larger) tab_big gives much worse results than default (Edit: with a small dataset that I have now to test on; on production data it would apparently be the other way around and EXPLAIN would look like the second one).
+----+-------------+-----------+--------+-----------------+---------+---------+-------------------------------------------+---------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+--------+-----------------+---------+---------+-------------------------------------------+---------+----------------------------------------------+
| 1 | SIMPLE | tab_big | range | PRIMARY,id1 | PRIMARY | 4 | NULL | 1374793 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | tab_small | eq_ref | PRIMARY,id_b | PRIMARY | 12 | const,db.tab_big.id1,db.tab_big.id2 | 1 | Using index |
+----+-------------+-----------+--------+-----------------+---------+---------+-------------------------------------------+---------+----------------------------------------------+
On larger datasets EXPLAIN would probably look more like this (disregard the 'rows' values though - it's taken from a smaller dataset):
+----+-------------+-----------+------+---------------------+---------+---------+------------------+-------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+------+---------------------+---------+---------+------------------+-------+----------------------------------------------+
| 1 | SIMPLE | tab_small | ref | PRIMARY,id_b,id1 | PRIMARY | 4 | const | 259 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | tab_big | ref | PRIMARY,id1 | id1 | 4 | db.tab_small.id1 | 25692 | Using where |
+----+-------------+-----------+------+---------------------+---------+---------+------------------+-------+----------------------------------------------+
Any thoughts?