views:

1414

answers:

2

We have two tables resembling a simple tag-record structure as follows (in reality it's much more complex but this is the essance of the problem):

tag (A.a) | recordId (A.b)
1         | 1
2         | 1
2         | 2
3         | 2
....

and

recordId (B.b) | recordData (B.c)
1              | 123
2              | 666
3              | 1246

The problem is fetching ordeded records with a specific tag. The obvious way of doing it is with a simple join and indexes on (PK)(A.a, A.b), (A.b), (PK)(B.b), (B.b,B.c) as such:

select A.a, A.b, B.c from A join B on A.b = B.b where a = 44 order by c;

However this gives the unpleasant result of a filesort:

+----+-------------+-------+------+---------------+---------+---------+-----------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key     | key_len | ref       | rows | Extra                                        |
+----+-------------+-------+------+---------------+---------+---------+-----------+------+----------------------------------------------+
|  1 | SIMPLE      | A     | ref  | PRIMARY,b     | PRIMARY | 4       | const     |   94 | Using index; Using temporary; Using filesort | 
|  1 | SIMPLE      | B     | ref  | PRIMARY,b     | b       | 4       | booli.A.b |    1 | Using index                                  | 
+----+-------------+-------+------+---------------+---------+---------+-----------+------+----------------------------------------------+

Using a huge and extremely redundant "materialized view" we can get pretty decent performance but this at the expense of complicating the business-logic, something we would like to avoid, especially since the A and B tables already are MV:s (and are needed for other queries, and infact the same queries using a UNION).

create temporary table C engine=innodb as (select A.a, A.b, B.c from A join B on A.b = B.b);
explain select a, b, c from C where a = 44 order by c;

Further complicating the situation is the fact that we have conditionals on the B-table, such as range-filters.

select A.a, A.b, B.c from A join B on A.b = B.b where a = 44 AND B.c > 678 order by c;

But we are confidant we can handle this if the filesort problem goes away.

Does anyone know why the simple join in codeblock 3 above won't use the index for sorting and if we can get around the problem in some way without creating a new MV?

Below is the full SQL listing that we are using for testing.

DROP TABLE IF EXISTS A;
DROP TABLE IF EXISTS B;
DROP TABLE IF EXISTS C;
CREATE TEMPORARY TABLE A (a INT NOT NULL, b INT NOT NULL, PRIMARY KEY(a, b), INDEX idx_A_b (b)) ENGINE=INNODB;
CREATE TEMPORARY TABLE B (b INT NOT NULL, c INT NOT NULL, d VARCHAR(5000) NOT NULL DEFAULT '', PRIMARY KEY(b), INDEX idx_B_c (c), INDEX idx_B_b (b, c)) ENGINE=INNODB;

DELIMITER $$
CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
        DECLARE _cnt INT;
        SET _cnt = 1;
        WHILE _cnt <= cnt DO
                INSERT IGNORE INTO A SELECT RAND()*100, RAND()*10000;
                INSERT IGNORE INTO B SELECT RAND()*10000, RAND()*1000, '';
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$
DELIMITER ;

START TRANSACTION;
CALL prc_filler(100000);
COMMIT;
DROP PROCEDURE prc_filler;

CREATE TEMPORARY TABLE C ENGINE=INNODB AS (SELECT A.a, A.b, B.c FROM A JOIN B ON A.b = B.b);
ALTER TABLE C ADD (PRIMARY KEY(a, b), INDEX idx_C_a_c (a, c));

EXPLAIN EXTENDED SELECT A.a, A.b, B.c FROM A JOIN B ON A.b = B.b WHERE A.a = 44;
EXPLAIN EXTENDED SELECT A.a, A.b, B.c FROM A JOIN B ON A.b = B.b WHERE 1 ORDER BY B.c;
EXPLAIN EXTENDED SELECT A.a, A.b, B.c FROM A JOIN B ON A.b = B.b where A.a = 44 ORDER BY B.c;
EXPLAIN EXTENDED SELECT a, b, c FROM C WHERE a = 44 ORDER BY c;
-- Added after Quassnois comments
EXPLAIN EXTENDED SELECT A.a, A.b, B.c FROM  B FORCE INDEX (idx_B_c) JOIN A ON A.b = B.b WHERE A.a = 44 ORDER BY B.c;
EXPLAIN EXTENDED SELECT A.a, A.b, B.c FROM A JOIN B ON A.b = B.b WHERE A.a = 44 ORDER BY B.c LIMIT 10;
EXPLAIN EXTENDED SELECT A.a, A.b, B.c FROM  B FORCE INDEX (idx_B_c) JOIN A ON A.b = B.b WHERE A.a = 44 ORDER BY B.c LIMIT 10;
+3  A: 

When I try to reproduce this query using your scripts:

SELECT  A.a, A.b, B.c
FROM    A
JOIN    B
ON      A.b = B.b
WHERE   a = 44
ORDER BY
        c

, it completes in 0.0043 seconds (instantly), returns 930 rows and yields this plan:

1, 'SIMPLE', 'A', 'ref', 'PRIMARY', 'PRIMARY', '4', 'const', 1610, 'Using index; Using temporary; Using filesort'
1, 'SIMPLE', 'B', 'eq_ref', 'PRIMARY', 'PRIMARY', '4', 'test.A.b', 1, ''

It's quite efficient for such a query.

For such a query, you cannot use a single index both for filtering and sorting.

See this article in my blog for more detailed explanations:

If you expect your query to return few records, you should use the index on A for filtering and then sort using filesort (like the query above does).

If you expect it to return many records (and LIMIT them), you need to use index for sorting and then filter:

CREATE INDEX ix_a_b ON a (b);
CREATE INDEX ix_b_c ON b (c)

SELECT  *
FROM    B FORCE INDEX (ix_b_c)
JOIN    A
ON      A.b = B.b
ORDER BY
        b.c
LIMIT 10;

1, 'SIMPLE', 'B', 'index', '', 'ix_b_c', '4', '', 2, 'Using index'
1, 'SIMPLE', 'A', 'ref', 'ix_a_b', 'ix_a_b', '4', 'test.B.b', 4, 'Using index'
Quassnoi
With the real data the record-table is quite big (both in width and in number of rows, with lots of VARCHAR(255):s) and therefor the temporary table costs more as there is alot more data to copy. On our production db (8-core xeon with everything in memory) the query takes about 0.05-0.1s and a MV-test shows sub 0.01s times.
Paso
I don't get the same query plan as you printed above for the same query. Anyways, the change in ORDER doesnt really help me, sure it removes the filesort but I get the results in the wrong order! Also, just changing the ORDER in the original query to "B.b, B.c" removes the filesort, indicating (well, to me ;)) that it could be possible to do this without a temporary table/filesort. (Funny thing, I actually borrowed the SP for inserting from your blog)
Paso
@Paso: Sorry, didn't understand your task well. Create an index on `b.c` only and change the `ORDER BY` condition. I'll update it in the post now.
Quassnoi
@Paso: I noticed :) Glad to hear you read my blog. One little note: when filling an `InnoDB` table in a procedure, always do it in a transaction, it's way faster.
Quassnoi
@Paso: could you please post which plans do you get when running the queries?
Quassnoi
Now I actually get the same query execution plan as in your post, must have been something funny with the indexing last time I ran it. However the execution time quickly gets pretty bad as the LIMIT increases so I'm not sure it's relevant for me. Infact I'm not sure I can use a LIMIT at all without some major rewrite of the business logic.
Paso
@Paso: again, with this layout you cannot filter **and** sort. You need two things here: filter on `B.b` and sort on `B.c`. An index cannot satisfy both: you cannot build a single range ordered both on `b` and on `c`. If you expecting **few** values, you better filter a range using index on `B.b` and the sort the selected values, since sorting few values is fast. Otherwise select ordered data from the index on `B.c` and then just filter them using index on `A.b`. This will require selecting all values from the index on `B.c` in their order, and without a `LIMIT` it will take constant time.
Quassnoi
Ok, thanks. I guess I will have to go the MV way with the added complexity that means.
Paso
A: 

select A.a, A.b, B.c from A join B on A.b = B.b where a = 44 order by c;

If you alias the columns, does that help? Example:

 SELECT 
 T1.a AS colA, 
 T2.b AS colB, 
 T2.c AS colC 
 FROM A AS T1 
 JOIN B AS T2 
 ON (T1.b = T2.b) 
 WHERE 
 T1.a = 44 
 ORDER BY colC;

The only changes I made were:

  • I put the join conditions in parenthesis
  • The join conditions and where conditions are based on table columns
  • The ORDER BY condition is based on the resulting table column
  • I aliased the result table columns and the queried tables to (hopefully) make it more clear when I was using one or the other (and more clear to the server. You neglect to refer to your columns in two places in your original query).

I know your real data is more complex, but I assume that you provided a simple version of the query because the problem is at that simple level.

Anthony
I'm afraid not, your query gives exactly the same EXPLAIN result.
Paso
Are you actually wanting to join the two tables? What I mean is, do the two tables link up where each row is a full result based on the query, or is more like each row has the data needed from both tables? I ask because if the two tables are not actually tied together in such a way that a join is required, you might consider a UNION instead. With a UNION, the queries are completely independent and thus no sub-queries or temporary tables or anything else taxing needs to happen.
Anthony
I don't really understand. The tables are JOINed over A.b = B.b and I need the data from B for each A matching a condition, how would a UNION help here? For completeness; no I dont need all the data, only the data from B. See the tag-example at the top of the question, that should explain everything as precisely as I can.
Paso