ansaurus

Question

Answer 1

A:

Have you tried tab_small LEFT JOIN tab_big? Also you can create indexes on the fields tab_small.id_b and tab_big.id_a

rubayeet 2009-10-09 03:55:53

Tried the LEFT JOIN just in case, worked actually worse. I actually have a tab_small id_b index; adding a tab_big.id_a index didn't help though.

Mike 2009-10-09 04:22:59

Answer 2

A:

I would suggest to put an index on all four columns that are part of the join (either four separate indexes on the tb.id1, tb.id2, ts.id1 and ts.id2 column, or two on tb.id1/id2 and ts.id1/id2). Then see if that gives you any better performance. (I think it does, but you never know unless trying it out.)

NOTE: The following idea does not work, but I left it in so the comments still make some sense.

Also, instead of using the PHP generated list, can't you express your restriction (3) in the join condition (or if you prefer, in the where clause) as well? (Similar to what rexem suggested)

SELECT tb.id_a
  FROM TAB_BIG tb
  JOIN TAB_SMALL ts ON ts.id1 = tb.id1
                 AND ts.id2 = tb.id2
                 AND tb.id1 <> ts.id_a
                 AND tb.id2 <> ts.id_a
 WHERE ts.id_b = ?

But this is more for clarity and simplicity than performance. (Also note that the additional conditions may require another index on id_a and probably separate indexes on tb.id1 and tb.id2.)

IronGoofy 2009-10-09 07:01:31

Tried adding the id1, id2 indexes, didn't help (explain still says it uses PRIMARY).Wouldn't the <> clauses here exclude only those entries where one of id1, id2 is the same as id\_a in this particular entry? I need to exclude _all_ the id\_a's that ever appear (as id1 or id2) in a ts record with a particular id\_b.

Mike 2009-10-09 12:07:00

Okay, then the EXISTS as by rexem would be right (or the statement by Quassnoi). I'll leave the suggestion in the post for clarity.

IronGoofy 2009-10-09 12:17:13

Answer 3

+2 A:

Create the following indexes:

CREATE INDEX ix_big_1_2_a ON tab_big (id1, id2, id_a)
CREATE UNIQUE INDEX ux_small_b_2_1 ON tab_small (id_b, id2, id1)

and try this:

SELECT  DISTINCT
        a.id_a
FROM    tab_small b
JOIN    tab_big a
ON      (a.id1, a.id2) = (b.id1, b.id2)
WHERE   b.id_b = 2
        AND a.id_a NOT IN
        (
        SELECT  id1
        FROM    tab_small b1 /* FORCE INDEX (PRIMARY) */
        WHERE   b1.id_b = 2
        )
        AND a.id_a NOT IN
        (
        SELECT  id2
        FROM    tab_small b2 /* FORCE INDEX (ux_small_b_2_1) */
        WHERE   b2.id_b = 2
        )

, which produces this query plan:

1, 'PRIMARY', 'b', 'ref', 'PRIMARY,ux_small_b_2_1', 'PRIMARY', '4', 'const', 1, 100.00, 'Using index; Using temporary'
1, 'PRIMARY', 'a', 'ref', 'ix_big_1_2', 'ix_big_1_2', '8', 'test.b.id1,test.b.id2', 2, 100.00, 'Using where'
3, 'DEPENDENT SUBQUERY', 'b2', 'ref', 'ux_small_b_2_1', 'ux_small_b_2_1', '8', 'const,func', 1, 100.00, 'Using index'
2, 'DEPENDENT SUBQUERY', 'b1', 'ref', 'PRIMARY', 'PRIMARY', '8', 'const,func', 1, 100.00, 'Using index'

It is not as efficient as it could be, still I'm expecting this to be faster than your query.

I commented out the FORCE INDEX statements, but you may need to uncomment them is the optimizer will not pick these indexes.

Everything would be much simpler if MySQL were capable of doing FULL OUTER JOIN using MERGE, but it does not.

Update:

Judging by your statistics, this query will be more efficient:

SELECT  id_a
FROM    (
        SELECT  DISTINCT id_a
        FROM    tab_big ad
        ) a
WHERE   id_a NOT IN
        (
        SELECT  id1
        FROM    tab_small b1 FORCE INDEX (PRIMARY)
        WHERE   b1.id_b = 2
        )
        AND id_a NOT IN
        (
        SELECT  id2
        FROM    tab_small b2 FORCE INDEX (ux_small_b_2_1)
        WHERE   b2.id_b = 2
        )
        AND EXISTS
        (
        SELECT  NULL
        FROM    tab_small be
        JOIN    tab_big ae
        ON      (ae.id1, ae.id2) = (be.id1, be.id2)
        WHERE   be.id_b = 2
                AND ae.id_a = a.id_a
        )

It works as follows:

Builds the list of DISTINCT id_a (which is 100,000 rows long)
Filters out the values present in the subset
For each value of id_a, it searches the subset for the presence of (id_a, id1, id2). This is done by iterating the subset. Since the probability to find this value is high, most probably the search will succeed in 10 rows or so from the beginning of the subset, and EXISTS will return that very moment.

This will most probably need to evaluate just about 1,000,000 records or so.

Make sure that the following plan is used:

1, 'PRIMARY', '<derived2>', 'ALL', '', '', '', '', 8192, 100.00, 'Using where'
5, 'DEPENDENT SUBQUERY', 'be', 'ref', 'PRIMARY,ux_small_b_2_1', 'PRIMARY', '4', 'const', 1, 100.00, 'Using index'
5, 'DEPENDENT SUBQUERY', 'ae', 'eq_ref', 'PRIMARY,ix_big_1_2', 'PRIMARY', '12', 'a.id_a,test.be.id1,test.be.id2', 1, 100.00, 'Using index'
4, 'DEPENDENT SUBQUERY', 'b2', 'ref', 'ux_small_b_2_1', 'ux_small_b_2_1', '8', 'const,func', 1, 100.00, 'Using index'
3, 'DEPENDENT SUBQUERY', 'b1', 'ref', 'PRIMARY', 'PRIMARY', '8', 'const,func', 1, 100.00, 'Using index'
2, 'DERIVED', 'ad', 'range', '', 'PRIMARY', '4', '', 10, 100.00, 'Using index for group-by'

, the most important part being Using index for group-by in the last row.

Quassnoi 2009-10-09 12:11:49

I don't understand why you'd define the indexes as you suggested. In order for the join to work with the indexes, wouldn't all the columns used in the join have to be indexed and in the same order as in the join conditions?My feeling is that the statement is slow because of the join ... not because of the subqueries!

IronGoofy 2009-10-09 12:22:56

The columns used in the `JOIN` are indexed in `ix_big_1_2_a`. The statement may (or may not) be slow because of the `JOIN`, but we cannot tell it's the real reason until we know how many rows in `tab_big` satisfy the `JOIN` condition.

Quassnoi 2009-10-09 12:30:18

Nice!First of all, the ix_big_1_2_a makes a huge difference with the original query.Second, the query you suggested works even better. Unfortunately it loses the ORDER BY part from the original query (which is supposed to present the most suitable entries first), but I might be able to cheat around that.Thanks a bunch! I really appreciate it. :)

Mike 2009-10-09 17:09:20

ansaurus

tags:

views:

answers:

MySQL: optimizing a JOIN query

related questions