tags:

views:

306

answers:

3

I'm trying to do the most efficient select on a table with 3 million records.

First some detailed info

Table:

CREATE TABLE IF NOT EXISTS `activities_index` (
  `id` int(9) NOT NULL auto_increment,
  `activity_id` int(6) NOT NULL,
  `activity_status_id` int(2) NOT NULL,
  `activity_source_id` int(6) default NULL,
  `account_id` int(6) default NULL,
  `owner_account_id` int(4) default NULL,
  `date` date NOT NULL,
  `is_event` int(1) NOT NULL,
  `name` varchar(255) collate utf8_unicode_ci NOT NULL,
  `content` longtext collate utf8_unicode_ci,
  `location_name` varchar(255) collate utf8_unicode_ci default NULL,
  `location_content` longtext collate utf8_unicode_ci,
  `meta_keywords` varchar(255) collate utf8_unicode_ci default NULL,
  `thumb_filename` varchar(255) collate utf8_unicode_ci default NULL,
  `popular` int(1) NOT NULL default '0',
  `price` float default NULL,
  `city_id` int(9) default NULL,
  `province_id` int(4) default NULL,
  `country_id` int(4) default NULL,
  `activity_location_id` int(6) NOT NULL,
  `lat` decimal(10,6) default NULL,
  `lng` decimal(10,6) default NULL,
  `activity_modified` datetime default NULL,
  `activity_created` datetime NOT NULL,
  `activity_location_modified` datetime default NULL,
  `activity_location_created` datetime NOT NULL,
  `modified` timestamp NOT NULL default CURRENT_TIMESTAMP on update CURRENT_TIMESTAMP,
  PRIMARY KEY  (`id`),
  KEY `is_event_idx` (`is_event`),
  KEY `activity_id_idx` (`activity_id`),
  KEY `status_city_idx` (`activity_status_id`, `city_id`),
  KEY `date_idx` (`date`),
  FULLTEXT KEY `txt_fields_idx` (`name`,`location_name`,`meta_keywords`)
) ENGINE=MyISAM  DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci AUTO_INCREMENT=14865 ;

Query:

SELECT SQL_NO_CACHE * FROM `activities_index` WHERE 
date BETWEEN '2009-12-23' AND '2010-1-23' AND
activity_status_id = 1 AND
city_id IN ('86', '84', '87', '2381', '453', '137', '1561', '1116', '1614', '2456', '512', '305', '443', '1182', '2229')
ORDER BY date
LIMIT 25

About my index choice:
The main problem is the range select on DATE. Why I'm not using an multiple column index based on what I believe is true, please correct me if I'm wrong: MySQL doesn't use any indices after a range. So the an index (DATE, ACTIVITY_STATUS_ID, CITY_ID) would be useless. An order by on the index table is only correct when using the right prefix. So a mutiple column index(CITY_ID, ACTIVITY_STATUS_ID, DATE) would not give the correct ordered result since we want to order the data on the column DATE.

Explain:
When doing a EXPLAIN on the query the possible_keys order is CITY_IDX, DATE_STATUS_IDX instead I think flipping that order to DATE_IDX, CITY_IDX would be more efficient when ordering by DATE.

id  select_type  table  type  possible_keys  key  key_len  ref  rows  Extra<br />
1  SIMPLE  activities_index  range  city_idx,date_idx  city_idx  5  NULL  1363  Using where; Using filesort

My questions:
How can I flip the order of the possible_keys?
Is there a better way of fixing the issue: an efficient select on a table with 3 million records?
Is my way of thinking the correct one?

A: 

As far as I remember the sql-query-analyzer parses the query from right to left - so the first index he meets is the city-one because it's the rightest. maybe you can flip the indices by changing the position of the in and the between-clause. Do you need all the information from your table? if not you could gain some speed by selecting only the columns you need.

Gambrinus
Ah. That seems not to work. In the application we define specific fields. I thought it would be easier to read if I did not include those fields.
Ruuts
A: 

I'm now thinking of something totally different. Since the city_ids are a result of a base_city + range it would be possible to use only the date plus a algorithm in the where clause for defining the distance of the base_city -> activity. This takes about 0.009 sec to complete. Downside is the usage of that we sometimes still use the city_ids. Hmm.

SQL_NO_CACHE *
FROM `activities_index` AS idx
WHERE 
ROUND(
((acos(sin((52.220818*pi()/180)) * sin(( idx.lat *pi()/180)) + cos((52.220818*pi()/180)) * cos(( idx.lat *pi()/180)) * cos(( (6.891140 -  idx.lng )*pi()/180 )))) 
*180/pi()) *60*1.1515*1.609344
) < 15 AND idx.date BETWEEN '2009-12-23' AND '2010-1-23'
ORDER BY idx.date
LIMIT 25
Ruuts
A: 

Some interesting information on index mergining. Unfortunately, your query is a perfect example of one of the listed deficiencies (a single range scan).

Whether or not the query in your reply is any better depends a lot on how many rows you have in a given date range, because you definitely won't get any optimization out of that algorithm. However, if the date range can narrow the rows sufficiently, that could be the most effective.

Note: the order of the possible_keys in the EXPLAIN output is not significant. Your wording also makes it sound as if you interpret the EXPLAIN output to say that it is doing a range select using date. It is not. It is doing a range select on city_id (it will scan every row with a city_id value between the min and max values in your IN() clause. The efficiency of doing so will depend greatly on the distribution of your values.

Have you tried running ANALYZE TABLE activities_index to see if the speed of the query and/or the output of EXPLAIN changes. MySQL often tries to predict value distributions based on column type, but actually analyzing the table gives a true distribution to use, which can allow it to better select the best key(s).

Rob Van Dam