views:

95

answers:

5

Hello all, long time lurker, first question!

I am struggling to optimize this query, which selects the lowest priced items that match the chosen filters:

SELECT product_info.*, MIN(product_all.sale_price) as sale_price, product_all.buy_link
FROM product_info
NATURAL JOIN (SELECT * FROM product_all WHERE product_all.date = '2010-09-30') as product_all
WHERE (product_info.category = 2  
AND product_info.gender = 'W' )
GROUP BY product_all.prod_id
ORDER BY MIN(product_all.sale_price) ASC LIMIT 13

Its explain:

| id | select_type | table        | type   | possible_keys                                             | key     | key_len | ref                 | rows   | Extra                           |  
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+  
|  1 | PRIMARY     | <derived2>   | ALL    | NULL                                                     | NULL    | NULL    | NULL                | 89801  | Using temporary; Using filesort | 
|  1 | PRIMARY     | product_info | eq_ref | PRIMARY,category_prod_id_retail_price,category_ret...     | PRIMARY | 4       | product_all.prod_id | 1      | Using where                     | 
|  2 | DERIVED     | product_all  | ref    | date_2                                                    | date_2  | 3       |                     | 144107 |                                 | 

I've tried eliminating the subquery, which intuitively seems better but in practice takes even longer:

SELECT product_info.*, MIN(product_all.sale_price) as sale_price, product_all.buy_link
FROM product_info
NATURAL JOIN product_all
WHERE (product_all.date = '2010-09-30'
AND product_info.category = 2 
AND product_info.gender = 'W' )
GROUP BY product_all.prod_id
ORDER BY MIN(product_all.sale_price) ASC LIMIT 13

And its explain:

| id | select_type | table        | type | possible_keys                                             | key                      | key_len | ref                               | rows | Extra                                        |  
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+  
|  1 | SIMPLE      | product_info | ref  | PRIMARY,category_prod_id_retail_price,category_ret...     | category_retail_price    | 5       | const                             | 269  | Using where; Using temporary; Using filesort | 
|  1 | SIMPLE      | product_all  | ref  | PRIMARY,prod_id,date_2                                    | prod_id                  | 4       | equipster_db.product_info.prod_id | 141  | Using where                                  | 

Here are the tables:

CREATE TABLE `product_all` (
`prod_id` INT( 10 ) NOT NULL PRIMARY KEY ,
`ref_id` INT( 10) NOT NULL PRIMARY KEY ,
`date` DATE NOT NULL ,
`buy_link` BLOB NOT NULL ,
`sale_price` FLOAT NOT NULL
) ENGINE = MYISAM ;


CREATE TABLE `product_info` (
`prod_id` INT( 10 ) NOT NULL AUTO_INCREMENT PRIMARY KEY ,
`prod_name` VARCHAR( 200 ) NOT NULL,
`brand` VARCHAR( 50 ) NOT NULL,
`retail_price` FLOAT NOT NULL
`category` INT( 3 ) NOT NULL,
`gender` VARCHAR( 1 ) NOT NULL,
`type` VARCHAR( 10 ) NOT NULL
) ENGINE = MYISAM ;

My Questions:
-which query structure seems optimal?
-what indices would optimize this query?
-less importantly: how does the indexing approach change when adding or removing WHERE clauses or using a different ORDER BY, such as sorting by % off:

ORDER BY (1-(MIN(product_all.sale_price)/product_info.retail_price)) DESC  

edit: both queries' natural join acts on prod_id (one record in product_info can have multiple instances in product_all, which is why they need to be grouped)

+2  A: 

Indices make a massive difference in mysql, one query that took 15 minutes with a wrong set of indices took .2 seconds with the right ones, but its finding the right balance that is generally the issue. Naturally without some sample data its really hard to say if the below solution will save you any time, but in theory it should.

To answer your questions, I would redesign the tables like so:

CREATE TABLE `product_all` ( 
`prod_id` INT( 10 ) NOT NULL, 
`ref_id` INT( 10) NOT NULL, 
`date` DATE NOT NULL , 
`buy_link` BLOB NOT NULL , 
`sale_price` FLOAT NOT NULL,
PRIMARY KEY (prod_id, ref_id) ,
INDEX date_Index (`date` ASC),
UNIQUE INDEX prod_price_Index (prod_id ASC, sale_price ASC)
) ENGINE = MYISAM ; 


CREATE TABLE `product_info` ( 
`prod_id` INT( 10 ) NOT NULL AUTO_INCREMENT, 
`prod_name` VARCHAR( 200 ) NOT NULL, 
`brand` VARCHAR( 50 ) NOT NULL, 
`retail_price` FLOAT NOT NULL, 
`category` INT( 3 ) NOT NULL, 
`gender` VARCHAR( 1 ) NOT NULL, 
`type` VARCHAR( 10 ) NOT NULL,
PRIMARY KEY (prod_id) ,
UNIQUE INDEX prod_id_name_Index (prod_id ASC, prod_name ASC),
INDEX category_Index (category ASC),
INDEX gender_Index (gender ASC)
) ENGINE = MYISAM ;

SELECT product_info.*, MIN(product_all.sale_price) as sale_price, product_all.buy_link         
FROM product_info         
NATURAL JOIN (SELECT * FROM product_all WHERE product_all.date = '2010-09-30') as product_all         
WHERE (product_info.category = 2           
AND product_info.gender = 'W' )         
GROUP BY product_all.prod_id         
ORDER BY MIN(product_all.sale_price) ASC LIMIT 13        

The performance gain here is gained my indexing the main fields that are being joined upon and are featured in the where clause. Personally I would go with your first query as when you think about it that should perform better.

As far as I understand whats happening in the first and second query:

  • The first query is being filtered by a sub-query prior to doing the natural join, that means its only joining in the resulting data and not the whole table.
  • The second query is joining the entire second table and then filtering the resulting rows of the whole lot back to what you want.

As a rule of thumb normally you want to add indices on your major joining fields and also the fields that you use the most in where clauses. I've also put some unique indices on some of the fields that you will want to query regularly, such as prod_id_name_Index.

If this doesn't improve your performance if you could maybe post some dummy data to play with I might be able to get a faster solution that I can benchmark.

Here is an article that goes through indexing for performance in mysql, worth a read if you want to know more.

Good luck!

EDIT: Your final question I missed the first time, the answer is that if your indexing the main joining fields then changes to the where will only impact the overall performance slightly, but the unique indices I've put on the tables should account for the majority of things you'll want to base queries upon. The main thing to remember is if you query or join upon a field frequently then it should really be indexed, but minor queries and changes to the order by you should just not worry about in terms of realigning your indexing strategy.

JonVD
jon, thanks! those multiple column indices did the trick, and also your edit was spot on, the order by wasn't really dragging the query down since its only operating on 13 rows. cheers!
chrisblanch
A: 

Performance wise, its never a good thing to use

select *

You should use the individual column names instead.

select column1,column2 etc...
cfEngineers
word to that...one of the few things I did know, but figured it's negligible and improves the readability of my question.
chrisblanch
A: 

Personally I'm a sql minimalist and avoid any kind of sub queries or joins that can't be index to index columns.

If that's not really possible I'll probably run the subqueries individually to collect my keys, sort them client site, and then build a where in (...) clause.

JohnVD makes a lot of good points but if you're going to need to make a unique key including product_name you should really see if that can be normalized out into an it.

Indexing varchar columns is something to steer away from at all costs if possible. Each index entry is as big as the maximum size of the column, even if they're usually only a fraction of that. And if you're using a charset like utf-8 then the size is ~ maxlen+3.

With your limit it seems the order by is needed. But just as an FYI when you're doing a group by, if you're going to consume the whole result set then tack on an ORDER BY NULL. Run the two variants through explain to see why; the order by null eliminates an implied filesort and you can sort client side. (This isn't possible if you're doing group by with rollup though)

monitorme
A: 

You should stick with the second query. Use an index on the column which reduces the affected rows the most. In this case it might be the date. if the filter conditions always contain more than one column you should try a multicolumn index. MySQL will only use one index.

mitch
A: 

As Mitch stated, trying to find the criteria that would naturally have the lower count of records would definitely win for performance. And if Category + Gender would be very common, make that an index on BOTH columns. Additionally, once you find that optimimum criteria you might alter the following query to match it better. The "STRAIGHT_JOIN" tells MySQL to do it in the order you state instead of it trying to change the primary table used for querying basis and joining to the other... So, I don't know which is more accurate of index of category, gender or date... If Date is going to have less record basis, then I would swap THAT as the first table in the FROM clause, and mentally move ITs criteria on date to the first position of the WHERE clause (just me personally to keep in synch with tables visually). I've seen STRAIGHT_JOIN improve performance significantly in MANY situations that otherwise appeared to be simple queries.

SELECT STRAIGHT_JOIN
      product_info.*, 
      MIN(product_all.sale_price) as sale_price, 
      product_all.buy_link 
   FROM 
      product_info,
      product_all 
   where 
          product_info.category = 2   
      AND product_info.gender = 'W'
      and product_info.prod_id = product_all.prod_id
      AND product_all.date = '2010-09-30'
   GROUP BY 
      product_info.prod_id 
   ORDER BY 
      MIN(product_all.sale_price) ASC 
   LIMIT 13 
DRapp