views:

133

answers:

3

Say I have two MyISAM tables:

tab_big:   id1, id2, id_a, ord         (5 billion records)
tab_small: id1, id2, id_b              (1 billion records)


CREATE TABLE IF NOT EXISTS `tab_big` (
  `id_a` int(10) unsigned NOT NULL,
  `id1` int(10) unsigned NOT NULL,
  `id2` int(10) unsigned NOT NULL,
  `ord` int(10) unsigned NOT NULL DEFAULT '1',
  PRIMARY KEY (`id_a`,`id1`,`id2`),
  KEY `id1` (`id1`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;


CREATE TABLE IF NOT EXISTS `tab_small` (
  `id_b` int(10) unsigned NOT NULL,
  `id1` int(10) unsigned NOT NULL,
  `id2` int(10) unsigned NOT NULL,
  PRIMARY KEY (`id_b`,`id1`,`id2`),
  KEY `id_b` (`id_b`),
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

All fields are INT. In both tables the combination of the three id fields (respectively id1, id2, id_a and id1, id2, id_b) values is unique, so for both I created a primary key on these three fields.

I need an efficient query that gets unique values of id_a from the first table, where:

  1. id_b in the second table table is a given value (narrowing it down to ca. 10k entries)
  2. id1/id2 combination is identical in both tables
  3. id_a in the first table is not same as either of id1, id2 fields in the tab_small subset (as narrowed down by id_b field); after a bit of fiddling it seems that generating the list (around 200 ids) in php and providing it as text performs better than adding another JOIN).

I believe it's not very cacheable, since both tables change all the time (rows are added).

My current query is pretty straightforward:

SELECT tab_big.id_a FROM tab_big, tab_small
    WHERE tab_small.id_b = '$constant'
    AND tab_big.id1 = tab_small.id1 AND tab_big.id2 = tab_small.id2
    AND tab_big.id_a NOT IN ({comma delimited list of 200 ids})
    GROUP BY tab_big.id_a
    ORDER BY SUM(tab_big.ord) DESC
    LIMIT 10

It works, but not fast enough to really use it. What can be done with it?

EXPLAIN says it first gets a ranged query from tab_big, then applies that to tab_small (Edit: added below). I don't know why (EXPLAIN says the query uses primary keys), but adding a tab_big.id1 index seems to help a bit. Also, trying to make it go the other way around with STRAIGHT_JOIN, first selecting a 10k subset from (smaller) tab_small and then using it to search in (larger) tab_big gives much worse results than default (Edit: with a small dataset that I have now to test on; on production data it would apparently be the other way around and EXPLAIN would look like the second one).

+----+-------------+-----------+--------+-----------------+---------+---------+-------------------------------------------+---------+----------------------------------------------+
| id | select_type | table     | type   | possible_keys   | key     | key_len | ref                                       | rows    | Extra                                        |
+----+-------------+-----------+--------+-----------------+---------+---------+-------------------------------------------+---------+----------------------------------------------+
|  1 | SIMPLE      | tab_big   | range  | PRIMARY,id1     | PRIMARY | 4       | NULL                                      | 1374793 | Using where; Using temporary; Using filesort | 
|  1 | SIMPLE      | tab_small | eq_ref | PRIMARY,id_b    | PRIMARY | 12      | const,db.tab_big.id1,db.tab_big.id2       |       1 | Using index                                  | 
+----+-------------+-----------+--------+-----------------+---------+---------+-------------------------------------------+---------+----------------------------------------------+

On larger datasets EXPLAIN would probably look more like this (disregard the 'rows' values though - it's taken from a smaller dataset):

+----+-------------+-----------+------+---------------------+---------+---------+------------------+-------+----------------------------------------------+
| id | select_type | table     | type | possible_keys       | key     | key_len | ref              | rows  | Extra                                        |
+----+-------------+-----------+------+---------------------+---------+---------+------------------+-------+----------------------------------------------+
|  1 | SIMPLE      | tab_small | ref  | PRIMARY,id_b,id1    | PRIMARY | 4       | const            |   259 | Using index; Using temporary; Using filesort | 
|  1 | SIMPLE      | tab_big   | ref  | PRIMARY,id1         | id1     | 4       | db.tab_small.id1 | 25692 | Using where                                  | 
+----+-------------+-----------+------+---------------------+---------+---------+------------------+-------+----------------------------------------------+

Any thoughts?

A: 

Have you tried tab_small LEFT JOIN tab_big? Also you can create indexes on the fields tab_small.id_b and tab_big.id_a

rubayeet
Tried the LEFT JOIN just in case, worked actually worse. I actually have a tab_small id_b index; adding a tab_big.id_a index didn't help though.
Mike
A: 

I would suggest to put an index on all four columns that are part of the join (either four separate indexes on the tb.id1, tb.id2, ts.id1 and ts.id2 column, or two on tb.id1/id2 and ts.id1/id2). Then see if that gives you any better performance. (I think it does, but you never know unless trying it out.)


NOTE: The following idea does not work, but I left it in so the comments still make some sense.

Also, instead of using the PHP generated list, can't you express your restriction (3) in the join condition (or if you prefer, in the where clause) as well? (Similar to what rexem suggested)

SELECT tb.id_a
  FROM TAB_BIG tb
  JOIN TAB_SMALL ts ON ts.id1 = tb.id1
                 AND ts.id2 = tb.id2
                 AND tb.id1 <> ts.id_a
                 AND tb.id2 <> ts.id_a
 WHERE ts.id_b = ?

But this is more for clarity and simplicity than performance. (Also note that the additional conditions may require another index on id_a and probably separate indexes on tb.id1 and tb.id2.)

IronGoofy
Tried adding the id1, id2 indexes, didn't help (explain still says it uses PRIMARY).Wouldn't the <> clauses here exclude only those entries where one of id1, id2 is the same as id\_a in this particular entry? I need to exclude _all_ the id\_a's that ever appear (as id1 or id2) in a ts record with a particular id\_b.
Mike
Okay, then the EXISTS as by rexem would be right (or the statement by Quassnoi). I'll leave the suggestion in the post for clarity.
IronGoofy
+2  A: 

Create the following indexes:

CREATE INDEX ix_big_1_2_a ON tab_big (id1, id2, id_a)
CREATE UNIQUE INDEX ux_small_b_2_1 ON tab_small (id_b, id2, id1)

and try this:

SELECT  DISTINCT
        a.id_a
FROM    tab_small b
JOIN    tab_big a
ON      (a.id1, a.id2) = (b.id1, b.id2)
WHERE   b.id_b = 2
        AND a.id_a NOT IN
        (
        SELECT  id1
        FROM    tab_small b1 /* FORCE INDEX (PRIMARY) */
        WHERE   b1.id_b = 2
        )
        AND a.id_a NOT IN
        (
        SELECT  id2
        FROM    tab_small b2 /* FORCE INDEX (ux_small_b_2_1) */
        WHERE   b2.id_b = 2
        )

, which produces this query plan:

1, 'PRIMARY', 'b', 'ref', 'PRIMARY,ux_small_b_2_1', 'PRIMARY', '4', 'const', 1, 100.00, 'Using index; Using temporary'
1, 'PRIMARY', 'a', 'ref', 'ix_big_1_2', 'ix_big_1_2', '8', 'test.b.id1,test.b.id2', 2, 100.00, 'Using where'
3, 'DEPENDENT SUBQUERY', 'b2', 'ref', 'ux_small_b_2_1', 'ux_small_b_2_1', '8', 'const,func', 1, 100.00, 'Using index'
2, 'DEPENDENT SUBQUERY', 'b1', 'ref', 'PRIMARY', 'PRIMARY', '8', 'const,func', 1, 100.00, 'Using index'

It is not as efficient as it could be, still I'm expecting this to be faster than your query.

I commented out the FORCE INDEX statements, but you may need to uncomment them is the optimizer will not pick these indexes.

Everything would be much simpler if MySQL were capable of doing FULL OUTER JOIN using MERGE, but it does not.

Update:

Judging by your statistics, this query will be more efficient:

SELECT  id_a
FROM    (
        SELECT  DISTINCT id_a
        FROM    tab_big ad
        ) a
WHERE   id_a NOT IN
        (
        SELECT  id1
        FROM    tab_small b1 FORCE INDEX (PRIMARY)
        WHERE   b1.id_b = 2
        )
        AND id_a NOT IN
        (
        SELECT  id2
        FROM    tab_small b2 FORCE INDEX (ux_small_b_2_1)
        WHERE   b2.id_b = 2
        )
        AND EXISTS
        (
        SELECT  NULL
        FROM    tab_small be
        JOIN    tab_big ae
        ON      (ae.id1, ae.id2) = (be.id1, be.id2)
        WHERE   be.id_b = 2
                AND ae.id_a = a.id_a
        )

It works as follows:

  • Builds the list of DISTINCT id_a (which is 100,000 rows long)
  • Filters out the values present in the subset
  • For each value of id_a, it searches the subset for the presence of (id_a, id1, id2). This is done by iterating the subset. Since the probability to find this value is high, most probably the search will succeed in 10 rows or so from the beginning of the subset, and EXISTS will return that very moment.

This will most probably need to evaluate just about 1,000,000 records or so.

Make sure that the following plan is used:

1, 'PRIMARY', '<derived2>', 'ALL', '', '', '', '', 8192, 100.00, 'Using where'
5, 'DEPENDENT SUBQUERY', 'be', 'ref', 'PRIMARY,ux_small_b_2_1', 'PRIMARY', '4', 'const', 1, 100.00, 'Using index'
5, 'DEPENDENT SUBQUERY', 'ae', 'eq_ref', 'PRIMARY,ix_big_1_2', 'PRIMARY', '12', 'a.id_a,test.be.id1,test.be.id2', 1, 100.00, 'Using index'
4, 'DEPENDENT SUBQUERY', 'b2', 'ref', 'ux_small_b_2_1', 'ux_small_b_2_1', '8', 'const,func', 1, 100.00, 'Using index'
3, 'DEPENDENT SUBQUERY', 'b1', 'ref', 'PRIMARY', 'PRIMARY', '8', 'const,func', 1, 100.00, 'Using index'
2, 'DERIVED', 'ad', 'range', '', 'PRIMARY', '4', '', 10, 100.00, 'Using index for group-by'

, the most important part being Using index for group-by in the last row.

Quassnoi
I don't understand why you'd define the indexes as you suggested. In order for the join to work with the indexes, wouldn't all the columns used in the join have to be indexed and in the same order as in the join conditions?My feeling is that the statement is slow because of the join ... not because of the subqueries!
IronGoofy
The columns used in the `JOIN` are indexed in `ix_big_1_2_a`. The statement may (or may not) be slow because of the `JOIN`, but we cannot tell it's the real reason until we know how many rows in `tab_big` satisfy the `JOIN` condition.
Quassnoi
Nice!First of all, the ix_big_1_2_a makes a huge difference with the original query.Second, the query you suggested works even better. Unfortunately it loses the ORDER BY part from the original query (which is supposed to present the most suitable entries first), but I might be able to cheat around that.Thanks a bunch! I really appreciate it. :)
Mike