views:

90

answers:

3

Hi all,
I got this query:

SELECT user_id  
FROM basic_info  
WHERE age BETWEEN 18 AND 22 AND gender = 0  
ORDER BY rating  
LIMIT 50

The table looks like (and it contains about 700k rows):

CREATE TABLE IF NOT EXISTS `basic_info` (  
  `user_id` mediumint(8) unsigned NOT NULL auto_increment,  
  `gender` tinyint(1) unsigned NOT NULL default '0',  
  `age` tinyint(2) unsigned NOT NULL default '0',  
  `rating` smallint(5) unsigned NOT NULL default '0',  
  PRIMARY KEY  (`user_id`),  
  KEY `tmp` (`gender`,`rating`),  
) ENGINE=MyISAM;

The query itself is optimized but it has to walk about 200k rows to do his job. Here's the explain output:

id  select_type  table  type  possible_keys  key  key_len  ref  rows  Extra  
1   SIMPLE  basic_info  ref  tmp,age  tmp  1  const  200451  Using where

Is it possible to optimize the query so it won't walk over 200k rows ?

Thanks !

+1  A: 

Extend you tmp-key to include the age-column:

KEY `tmp` (`age`,`gender`,`rating`)
Stefan Gehrig
The `rating` column is useless in this index. Since the query has a range condition on the `gender` field, results from this index will not be sorted, so ther will be a separate sort step no matter what.
intgr
If i extend the key as you propose the query performance degrade. Here's the explain output: id select_type table type possible_keys key key_len ref rows Extra 1 SIMPLE basic_info range tmp tmp 2 NULL 107375 Using where; Using filesort
plamen
@intgr : The rating column in the index save me from the using filesort
plamen
@intgr: You'r right... I read over the range condition.
Stefan Gehrig
+7  A: 

There are two useful indexes that can help this query:

KEY gender_age (gender, age) -- this index can satisfy both the gender=0 condition as well as age BETWEEN 18 AND 22. However, because you have a range condition over the age field, adding the rating column to the index will not give sorted results -- hence MySQL will select all matching rows -- ignoring your LIMIT clause -- and do an additional filesort regardless.

KEY gender_rating (gender, rating) -- the index you already have; this index can satisfy the gender=0 condition and retrieves data already sorted by rating. However, the database has to scan all elements with gender=0 and eliminate those who are not in range age BETWEEN 18 AND 22

Changing schema

If the above does not help you enough, changing your schema is always possible. One such approach is turning the age BETWEEN condition into an equality condition, by defining an age group column; for instance, ages 0-12 will be in age group 1, ages 12-18 in age group 2, etc.

This way, having an index with (gender, agegroup, rating) and query with WHERE gender=0 AND agegroup=3 ORDER BY rating will retrieve all results from the index and already sorted. In this case, the LIMIT clause should only fetch 50 entries from the table and no more.

intgr
+1 for the explanation, especially about the range condition.
Stefan Gehrig
@intgr: changing the schema sounds reasonable but it's not possible in my case because : what happen if the user says - give me all the users between 10 AND 12, or 11 AND 20 or even 10 AND 40?
plamen
If you want to allow queries over arbitrary age ranges then you're right, it does not help. Unfortunately I'm out of ideas for now.
intgr
@intgr: anyway i +1 because of the good advices + explanations. thanks!
plamen
+1  A: 

Attempt to use InnoDB to improve performence?

Benchmarking here

Shadi Almosri
@Shadi Almosri: yep, it's better, but it's "forbidden" to use INNODB :(
plamen