views:

56

answers:

3

I am trying to optimize the following query:

select distinct this_.id as y0_
from Rental this_
    left outer join RentalRequest rentalrequ1_ 
      on this_.id=rentalrequ1_.rental_id
    left outer join RentalSegment rentalsegm2_ 
      on rentalrequ1_.id=rentalsegm2_.rentalRequest_id
where
    this_.DTYPE='B'
    and this_.id<=1848978
    and this_.billingStatus=1
    and rentalsegm2_.endDate between 1273631699529 and 1274927699529
order by rentalsegm2_.id asc
limit 0, 100;

This query is done multiple time in a row for paginated processing of records (with a different limit each time). It returns the ids I need in the processing. My problem is that this query take more than 3 seconds. I have about 2 million rows in each of the three tables.

Explain gives:

+----+-------------+--------------+--------+-----------------------------------------------------+---------------+---------+--------------------------------------------+--------+----------------------------------------------+
| id | select_type | table        | type   | possible_keys                                       | key           | key_len | ref                                        | rows   | Extra                                        |
+----+-------------+--------------+--------+-----------------------------------------------------+---------------+---------+--------------------------------------------+--------+----------------------------------------------+
|  1 | SIMPLE      | rentalsegm2_ | range  | index_endDate,fk_rentalRequest_id_BikeRentalSegment | index_endDate | 9       | NULL                                       | 449904 | Using where; Using temporary; Using filesort | 
|  1 | SIMPLE      | rentalrequ1_ | eq_ref | PRIMARY,fk_rental_id_BikeRentalRequest              | PRIMARY       | 8       | solscsm_main.rentalsegm2_.rentalRequest_id |      1 | Using where                                  | 
|  1 | SIMPLE      | this_        | eq_ref | PRIMARY,index_billingStatus                         | PRIMARY       | 8       | solscsm_main.rentalrequ1_.rental_id        |      1 | Using where                                  | 
+----+-------------+--------------+--------+-----------------------------------------------------+---------------+---------+--------------------------------------------+--------+----------------------------------------------+

I tried to remove the distinct and the query ran three times faster. explain without the query gives:

+----+-------------+--------------+--------+-----------------------------------------------------+---------------+---------+--------------------------------------------+--------+-----------------------------+
| id | select_type | table        | type   | possible_keys                                       | key           | key_len | ref                                        | rows   | Extra                       |
+----+-------------+--------------+--------+-----------------------------------------------------+---------------+---------+--------------------------------------------+--------+-----------------------------+
|  1 | SIMPLE      | rentalsegm2_ | range  | index_endDate,fk_rentalRequest_id_BikeRentalSegment | index_endDate | 9       | NULL                                       | 451972 | Using where; Using filesort | 
|  1 | SIMPLE      | rentalrequ1_ | eq_ref | PRIMARY,fk_rental_id_BikeRentalRequest              | PRIMARY       | 8       | solscsm_main.rentalsegm2_.rentalRequest_id |      1 | Using where                 | 
|  1 | SIMPLE      | this_        | eq_ref | PRIMARY,index_billingStatus                         | PRIMARY       | 8       | solscsm_main.rentalrequ1_.rental_id        |      1 | Using where                 | 
+----+-------------+--------------+--------+-----------------------------------------------------+---------------+---------+--------------------------------------------+--------+-----------------------------+

As you can see, the Using temporary is added when using distinct.

I already have an index on all fields used in the where clause. Is there anything I can do to optimize this query?

Thank you very much!

Edit: I tried to order by on this_.id as suggested and the query was 5x slower. Here is the explain plan:

+----+-------------+--------------+------+-----------------------------------------------------+---------------------------------------+---------+------------------------------+--------+----------------------------------------------+
| id | select_type | table        | type | possible_keys                                       | key                                   | key_len | ref                          | rows   | Extra                                        |
+----+-------------+--------------+------+-----------------------------------------------------+---------------------------------------+---------+------------------------------+--------+----------------------------------------------+
|  1 | SIMPLE      | this_        | ref  | PRIMARY,index_billingStatus                         | index_billingStatus                   | 5       | const                        | 782348 | Using where; Using temporary; Using filesort | 
|  1 | SIMPLE      | rentalrequ1_ | ref  | PRIMARY,fk_rental_id_BikeRentalRequest              | fk_rental_id_BikeRentalRequest        | 9       | solscsm_main.this_.id        |      1 | Using where; Using index; Distinct           | 
|  1 | SIMPLE      | rentalsegm2_ | ref  | index_endDate,fk_rentalRequest_id_BikeRentalSegment | fk_rentalRequest_id_BikeRentalSegment | 8       | solscsm_main.rentalrequ1_.id |      1 | Using where; Distinct                        | 
+----+-------------+--------------+------+-----------------------------------------------------+---------------------------------------+---------+------------------------------+--------+----------------------------------------------+
+2  A: 

The reason that the query without the distinct runs faster is because you have a limit clause. Without the distinct, the server only needs to look at the first hundred matches. However however some of those rows may have duplicate fields, so if you introduce the distinct clause, the server has to look at many more rows in order to find ones that do not have duplicate values.

BTW, why are you using OUTER JOIN?

e4c5
@Manuel Darveau: Indeed, why use OUTER JOINs. Have you tried/validated the results using 'normal' joins. Should be much faster!
lexu
+1  A: 

Hi,

Here for "rentalsegm2_" table, optimizer has chosen "index_endDate" index and its no of rows expected from this table is about 4.5 lakhs. Since there are other where conditions exist, you can check for "this_" table indexes . I mean you can check in "this_ table" for how much records affected for each where conditions.

In summary, you can try for alternate solutions by changing indices used by optimizer. This can be obtained by "USE INDEX", "FORCE INDEX" commands.

Thanks

Rinson KE DBA www.qburst.com

RINSON KE
+1  A: 
  1. From the execution plan we see that the optimizer is smart enough to understand that you do not require OUTER JOINs here. Anyway, you should better specify that explicitly.
  2. The DISTINCT modifier means that you want to GROUP BY all fields in SELECT part, that is ORDER BY all of the specified fields and then discard duplicates. In other words, order by rentalsegm2_.id asc clause does not make any sence here.

The query below should return the equivalent result:

select distinct this_.id as y0_
from Rental this_
    join RentalRequest rentalrequ1_ 
      on this_.id=rentalrequ1_.rental_id
    join RentalSegment rentalsegm2_ 
      on rentalrequ1_.id=rentalsegm2_.rentalRequest_id
where
    this_.DTYPE='B'
    and this_.id<=1848978
    and this_.billingStatus=1
    and rentalsegm2_.endDate between 1273631699529 and 1274927699529
limit 0, 100;

UPD

If you want the execution plan to start with RentalSegment, you will need to add the following indices to the database:

  1. RentalSegment (endDate)
  2. RentalRequest (id, rental_id)
  3. Rental (id, DTYPE, billingStatus) or (id, billingStatus, DTYPE)

The query then could be rewritten as the following:

SELECT this_.id as y0_
FROM RentalSegment rs
    JOIN RentalRequest rr
    JOIN Rental this_
WHERE rs.endDate between 1273631699529 and 1274927699529
    AND rs.rentalRequest_id = rr.id
    AND rr.rental_id <= 1848978
    AND rr.rental_id = this_.id
    AND this_.DTYPE='D'
    AND this_.billingStatus = 1
GROUP BY this_.id
LIMIT 0, 100;

If the execution plan will not start from RentalSegment you can force in with STRAIGHT_JOIN.

newtover
I don't understand why the order by does not make sense. Since this is a paginated criteria (I will increase the first param in limit on every call), I must specify an oder by else results will not always be returned in the same order and the pagination will not work. Am I missing something?
Manuel Darveau
@Manuel Darveau: the field `rentalsegm2_.id` is not mentioned in the SELECT part and can potentially be in 1-to-many relation to this._id. When you GROUP BY `this._id`, sorting by `rentalsegm2_.id` does not make sense anymore, since the items by which the sorting is performed can be taken arbitrarily. GROUPing operation (and DISTINCT) assumes implicit sorting by grouped columns (you can even say `GROUP BY this_id DESC` in MySQL) and the results go in that sorted order.
newtover
@Manuel Darveau: I meant that ordering by `rentalsegm2_.id` does not make sence, but explicit ordering by `this_.id asc` does (it is implicit in this case).
newtover
@newtover: I used to order by this_.id but the query took 5 times longer. The problem is that mysql first select on the Rental table, then on RentalRequest and on RentalSegment. The explain plan is the opposite of what I originally posted. I tested your query and took 12 sec (compared to my original query which took 3 sec.)My biggest restriction is on RentalSegment so I guess MySQL should start the query on it.
Manuel Darveau
@Manuel Darveau: I updated my answer.
newtover