views:

82

answers:

4

I have a mysql query that gets a list of private messages where a user is either the sender, or receiver.

    SELECT 
    users_user1.user_name AS pm_username_1, 
    users_user1.user_avatar AS pm_username_1_avatar,
    users_user2.user_name AS pm_username_2,
    users_user2.user_avatar AS pm_username_2_avatar, 
    pms.*
FROM pm pms
LEFT JOIN users users_user1 
    ON users_user1.user_id = pms.pm_sender
LEFT JOIN users users_user2
    ON users_user2.user_id = pms.pm_receiver
WHERE pm_thread = pm_id 
    AND (pm_receiver = '1' OR pm_sender = '1')
    AND pm_delete != '1'
ORDER by pm_thread_last DESC LIMIT 0, 15

The problem is.... as far as I can tell... it cannot use any index.

Any way I can get around that?

EDIT

+----+-------------+-------------+--------+---------------+---------+---------+------------------------+-------+-----------------------------+
| id | select_type | table       | type   | possible_keys | key     | key_len | ref                    | rows  | Extra                       |
+----+-------------+-------------+--------+---------------+---------+---------+------------------------+-------+-----------------------------+
|  1 | SIMPLE      | pms         | ALL    | pm_receiver   | NULL    | NULL    | NULL                   | 25354 | Using where; Using filesort |
|  1 | SIMPLE      | users_user1 | eq_ref | PRIMARY       | PRIMARY | 4       | movies.pms.pm_sender   |     1 |                             |
|  1 | SIMPLE      | users_user2 | eq_ref | PRIMARY       | PRIMARY | 4       | movies.pms.pm_receiver |     1 |                             |
+----+-------------+-------------+--------+---------------+---------+---------+------------------------+-------+-----------------------------+

Altered the schema to this:

(SELECT 
    users_user1.user_name AS pm_username_1, 
    users_user1.user_avatar AS pm_username_1_avatar,
    users_user2.user_name AS pm_username_2,
    users_user2.user_avatar AS pm_username_2_avatar, 
    pms.*
FROM pm pms
LEFT JOIN users users_user1 
    ON users_user1.user_id = pms.pm_sender
LEFT JOIN users users_user2
    ON users_user2.user_id = pms.pm_receiver
WHERE pm_thread = pm_id 
    AND (pm_receiver = '1')
    AND pm_delete != '1')
UNION
(SELECT 
    users_user1.user_name AS pm_username_1, 
    users_user1.user_avatar AS pm_username_1_avatar,
    users_user2.user_name AS pm_username_2,
    users_user2.user_avatar AS pm_username_2_avatar, 
    pms.*
FROM pm pms
LEFT JOIN users users_user1 
    ON users_user1.user_id = pms.pm_sender
LEFT JOIN users users_user2
    ON users_user2.user_id = pms.pm_receiver
WHERE pm_thread = pm_id 
    AND (pm_sender = '1')
    AND pm_delete != '1')
ORDER by pm_thread_last DESC LIMIT 0, 15

EXPLAIN

+----+--------------+-------------+--------+---------------+-------------+---------+------------------------+------+----------------+
| id | select_type  | table       | type   | possible_keys | key         | key_len | ref                    | rows | Extra          |
+----+--------------+-------------+--------+---------------+-------------+---------+------------------------+------+----------------+
|  1 | PRIMARY      | pms         | ref    | pm_receiver   | pm_receiver | 4       | const                  |  336 | Using where    |
|  1 | PRIMARY      | users_user1 | eq_ref | PRIMARY       | PRIMARY     | 4       | movies.pms.pm_sender   |    1 |                |
|  1 | PRIMARY      | users_user2 | eq_ref | PRIMARY       | PRIMARY     | 4       | movies.pms.pm_receiver |    1 |                |
|  2 | UNION        | pms         | ref    | pm_sender     | pm_sender   | 4       | const                  |  283 | Using where    |
|  2 | UNION        | users_user1 | eq_ref | PRIMARY       | PRIMARY     | 4       | movies.pms.pm_sender   |    1 |                |
|  2 | UNION        | users_user2 | eq_ref | PRIMARY       | PRIMARY     | 4       | movies.pms.pm_receiver |    1 |                |
| NULL | UNION RESULT | <union1,2>  | ALL    | NULL          | NULL        | NULL    | NULL                   | NULL | Using filesort |
+----+--------------+-------------+--------+---------------+-------------+---------+------------------------+------+----------------+
A: 

You can force the issue with index hints, but doing so may not result in better performing queries.

See http://dev.mysql.com/doc/refman/5.0/en/index-hints.html

What index definitions were you using?

David Andres
+1  A: 

Yeah, MySQL can use an index in an OR expression. How do you know its not using your index, did you use EXPLAIN to see how MySQL is running your query? How many rows do you have in that table? If the row count is too small then MySQL wont use an index as its faster to do a full table scan. I think the threshold is 100 - if a table has less than 100 rows than it will always do a table scan versus using an index.

Cody Caughlan
The table has a lot more records than that. I did use explain.... and its not using the index with the OR expression.
Yegor
A: 

Indeed, since it's an OR criteria, MySQL cannot use any index on either of the columns mentioned. That's because an index would allow you to search by one or the other column, but not both at the same time.

I'd suggest splitting the query into two queries so that you don't have to use the OR. And before that - check if this is really what's giving you performance problems. Perhaps you're trying to solve the wrong problem.


Added: After seeing the full query all I can say is - rethink your data structure. This might be pretty good for data integrity or something, but you just can't write a query like this without a full table scan. If you can't restructure it, maybe add another table with the necessary information cached. You'll have to keep the cache up to date though.

Vilx-
A: 

If you think about what the optimizer wants to do, it is quite hard for it to make effective use on the query shown.

When the optimizer reads via an index, it gets the column values for the columns that are indexed, plus information about how many rows contain those values and where to find those rows. Clearly, for a unique index, the 'how many rows' information is 1. There are also, typically, methods for finding index entries for a specific set of row values (all index methods, I think). For some index types, there is a way of finding the first index entry with a partial match for the leading columns of the index (B-tree indexes and relatives). I will assume that the information about where to find the rows is stored as a 'rowid'; the terminology is not completely uniform across DBMS, but will serve. So, an index entry in general identifies the key values and a set of rows where the columns hold the key values.

I propose to ignore the pm_thread = pm_id criterion because it looks like a join criterion. If it is actually a condition between two columns of the sole table in the query, then it is problematic too - not readily searchable via an index.

The other two conditions are:

  1. (pm_receiver = '1337' OR pm_sender = '1337')
  2. pm_delete != '1337'

The second of these is generally very unselective - a not equals condition typically returns almost all the rows in a table, and (standing alone) is best dealt with by a table scan rejecting the few rows that do not match. There can be exceptions to this, and this is why optimizers use statistics. Consider a small business in California; if most of its customers are also in California, then a condition state != 'CA' may be very selective if there are 30,000 customers in CA and 20 outside (but the similar condition state != 'AZ' is very non-selective; it might even select every row from the table, but eliminates at most 20 rows). But without statistics to justify such a contrarian conclusion, an optimizer will assume that the not equals condition is not selective.

This leaves the first condition - an OR clause on two different columns. The individual criteria are likely to be fairly selective; there won't be many rows which match pm_receiver = '1337', and there won't be many rows which match pm_sender = '1337'.

But how could an optimizer use an index to find the rows which satisfy one or the other conditions? If there are two indexes available, one with pm_receiver as the leading column and another with pm_sender as the leading column, then maybe the optimizer could read the 'rowids' for the rows from the first index, and also the 'rowids' for the rows from the second index, and then take the set union of those rowids. It could then proceed with the rest of the query processing. However, it is not automatically clear that using two indexes like that is faster than a table scan, and many optimizers do not do it. They would scan the table, and evaluate the conditions for each row in turn. And they would often be correct to do so - it is the fastest way for them to process the query.

If the optimizer tries to use just one of the indexes - because only one of the indexes is present, perhaps - then it is has a more complex job to do. If the index existed on (pm_receiver, pm_sender), then it could answer the query by scanning the whole index, looking for rows where either pm_receiver is '1337' or pm_sender is '1337'. Whether this is a performance win depends on the size of the columns, the size of the table, and internals of the execution engine. Most DBMS would not try this strategy, especially if they have to refer to the row on disk to complete the query anyway. If all the columns of relevance are contained in the index, scanning just the index might be a winning strategy, but if it needs to go to the disk for data too, then it is probably not a winning strategy.

(If the pm_thread = pm_id criterion is a condition between columns in a single row, it also cannot be evaluated via an index unless the index contains both columns, and also requires a complete index scan to find the rows where the condition applies. And the optimizer would prefer to use an index on the OR condition if that's possible because it would have better selectivity.)

So, given a normal DBMS with tables stored by rows (not a columnar database) and normal indexes, there isn't an easy way for the optimizer to use an index effectively to answer the query - so the optimizer chooses not to bother.


While typing the above, the question was amended to show a multi-way join with two LEFT OUTER JOIN (LOJ) criteria.

LOJ is a performance killer, all by itself. It should be avoided whenever possible. The presence of those makes using indexes much harder. We would need to know the complete schema of each of the tables involved, including the indexes on the tables. Even so, the optimizer will likely scan the dominant table (the one to which the others are left outer joined) and use indexed lookups to find the matching rows - or absence of matching rows - in the outer joined tables.

Jonathan Leffler