views:

44

answers:

4

The table looks like this:

    CREATE TABLE `tweet_tweet` (
      `id` int(11) NOT NULL AUTO_INCREMENT,
      `text` varchar(256) NOT NULL,
      `created_at` datetime NOT NULL,
      `created_date` date NOT NULL,
...
      `positive_sentiment` decimal(5,2) DEFAULT NULL,
      `negative_sentiment` decimal(5,2) DEFAULT NULL,
      `entity_id` int(11) DEFAULT NULL,
      PRIMARY KEY (`id`),
      KEY `tweet_tweet_entity_created` (`entity_id`,`created_at`)
    ) ENGINE=MyISAM AUTO_INCREMENT=1097134 DEFAULT CHARSET=utf8

The explain on the query looks like this:

mysql> explain SELECT `tweet_tweet`.`entity_id`, 
       STDDEV_POP(`tweet_tweet`.`positive_sentiment`) AS `sentiment_stddev`, 
       AVG(`tweet_tweet`.`positive_sentiment`) AS `sentiment_avg`, 
       COUNT(`tweet_tweet`.`id`) AS `tweet_count` 
       FROM `tweet_tweet` 
       WHERE `tweet_tweet`.`created_at` > '2010-10-06 16:24:43'  
       GROUP BY `tweet_tweet`.`entity_id` ORDER BY `tweet_tweet`.`entity_id` ASC;

+----+-------------+-------------+------+---------------+------+---------+------+---------+----------------------------------------------+
| id | select_type | table       | type | possible_keys | key  | key_len | ref  | rows    | Extra                                        |
+----+-------------+-------------+------+---------------+------+---------+------+---------+----------------------------------------------+
|  1 | SIMPLE      | tweet_tweet | ALL  | NULL          | NULL | NULL    | NULL | 1097452 | Using where; Using temporary; Using filesort |
+----+-------------+-------------+------+---------------+------+---------+------+---------+----------------------------------------------+
  1 row in set (0.00 sec)

About 300k rows are added to the table every day. The query runs about 4 seconds right now but I want to get it down to around 1 second and I'm afraid the query will take exponentially longer as the days go on. Total number of rows in tweet_tweet is currently only a little over 1M, but it will be growing fast.

Any thoughts on optimizing this? Do I need any more indexes? Should I be using something like Cassandra instead of MySQL? =)

A: 

You're not using the index tweet_tweet_entity_created. Change your query to:

explain SELECT `tweet_tweet`.`entity_id`, 
       STDDEV_POP(`tweet_tweet`.`positive_sentiment`) AS `sentiment_stddev`, 
       AVG(`tweet_tweet`.`positive_sentiment`) AS `sentiment_avg`, 
       COUNT(`tweet_tweet`.`id`) AS `tweet_count` 
       FROM `tweet_tweet` FORCE INDEX (tweet_tweet_entity_created)
       WHERE `tweet_tweet`.`created_at` > '2010-10-06 16:24:43'  
       GROUP BY `tweet_tweet`.`entity_id` ORDER BY `tweet_tweet`.`entity_id` ASC;

You can read more about index hints in the MySQL manual http://dev.mysql.com/doc/refman/5.1/en/index-hints.html

Sometimes MySQL's query optimizer needs a little help.

William
A: 

MySQL has a dirty little secret. When you create an index over multiple columns, only the first one is really "used". I've made tables that used Unique Keys and Foreign Keys, and I often had to set a separate index for one or more of the columns.

I suggest adding an extra index to just created_at at a minimum. I do not know if adding indexes to the aggregate columns will also speed things up.

Wolfman2000
Sometimes you need to let the query optimizer know what index to use.
William
@Wolfman2000: This is like saying that in a telephone book, only people's last names are used. Of course this is not true. First names are in there too, but the book is ordered is by last name first. So if you want to search for everyone with a given first name, the index doesn't help.
Bill Karwin
Guess I misunderstood the question. Sorry about that.
Wolfman2000
A: 

You may try to reorder fields in the index (i.e. KEY tweet_tweet_entity_created (created_at, entity_id). That will allow mysql to use the index to reduce the quantity of actual rows that need to be grouped and ordered).

cryo28
Wolfmann2000 in his answer said that mysql uses only the first key of the multicolumn index. Actually it is not really correct. It is up to the structure of B-Tree indices. If it makes sense for a particular query to use both keys from the multicolumn index, mysql will use both. If it is not - it will use only the first one. However, you can't use the second column of a multicolumn index from the query which doesn't need the first one. So when you create multicolumn indices you should always think about the order of columns in a multicolumn index you are going to create.
cryo28
That did it! The query now takes 0.73 seconds!
A: 

if your mysql version 5.1 or higher ,you can consider partitioning option for large tables.

http://dev.mysql.com/doc/refman/5.1/en/partitioning.html

Baris Akverdi