views:

96

answers:

3

I'm trying to 'bin' all of my customers based on the number of items purchased, and display the counts of each bin. I'm trying to see how many people(account_id) purchased one item, how many purchased two items, all the way through nine items, and then ten or more.

Here's the query I'm using - for what its worth, I'd expect the query to do a full-table-scan on sales in order to generate the results, but the whole process takes forever!

I'm coming from an Oracle background and I wrote the query as I would in Oracle.

 select  thecnt
      ,  count(*) 
   from  (select  count(*)
               ,  case when count(*) >= 10 then 'tenormore' else cast(count(*) as char) end thecnt
            from  sales
           where  created >= SUBDATE( CURRENT_DATE(), INTERVAL 60 DAY )
        group by  account_id) sub
group by  thecnt
order by  thecnt;

are there any gotchas in mysql when dealing with subqueries?

explain plan

+----+-------------+-------------------+-------+---------------+---------+---------+------+---------+----------+-----------------------------------------------------------+
| id | select_type | table             | type  | possible_keys | key     | key_len | ref  |     rows    | filtered | Extra                                                     |
+----+-------------+-------------------+-------+---------------+---------+---------+------+---------+----------+-----------------------------------------------------------+
|  1 | PRIMARY     | <derived2>        | ALL   | NULL          | NULL    | NULL    | NULL | 2143248 |   100.00 | Using temporary; Using filesort                           |
|  2 | DERIVED     | sales             | range | created       | created | 8       | NULL | 2012492 |   100.00 | Using where; Using index; Using temporary; Using filesort |
+----+-------------+-------------------+-------+---------------+---------+---------+------+---------+----------+-----------------------------------------------------------+
2 rows in set, 1 warning (1 hour 4 min 6.14 sec)


mysql> describe sales;
+-----------------+---------------------+------+-----+---------+-------+
| Field           | Type                | Null | Key | Default | Extra |
+-----------------+---------------------+------+-----+---------+-------+
| account_id      | char(36)            | NO   | PRI | NULL    |       |
| created         | datetime            | NO   | MUL | NULL    |       |
| histogram_value | bigint(20) unsigned | NO   | PRI | NULL    |       |
+-----------------+---------------------+------+-----+---------+-------+
+1  A: 

You probably are missing proper indexes.

EDIT:

Your query is slow because the subquerys resut dont fit into memory and temporary table on disk is being used.

So you would benefit from index on (account_id, created) which prevents it using tmp table on disk for subquery, if used

ALTER TABLE sales ADD INDEX ix_acc_cre (account_id, created)
Imre L
why would I want to add an index if i plan on scanning the entire table? Figure that there is 70 days worth of data in the table and i'm interested in the most recent 60. I should have been more specific when I said I plan on a full-table-scan. Unless mysql uses different access paths compared to oracle...
Neil Kodner
Right now mysql has to do lots of sorting and since indexes are sorted this can be avoided. If you **insist** on not using indexes then change you query so that `from sales` becomes `from sales use index ()` - this effectively disables use of any indexes
Imre L
+1  A: 

I don't see anything particularly wrong with your query. The reason why the query is slow is because it needs to use temporary tables and filesort. The only way to seriously speed up this query will be modify your MySQL settings to allocate more memory, so as to avoid using the disk for these processes. Here's a spot on article covering the pertinent settings.


Edit: Once you do this, you can also save memory by specifying an exact column to count instead of COUNT(*), and a few other minor tweaks, as some of the others have mentioned. You want to get as small a data set as necessary to make the most of your memory. But I think the overall issue won't go away unless you allocate more memory.

wuputah
ok. I'll change my query to count the # of orders instead of the whole row. I need to look at how mysql optimizes queries a little more closely.With non-privileged access, is there any way I can check the memory settings via mysql's data-dictionary? I dont have a shell account on the box.
Neil Kodner
You can use `SHOW GLOBAL VARIABLES` to view the current settings.
wuputah
A: 

An index can be useful in a full table scan if MySQL can extract the data out of the index instead of looking at the actual rows. You shouldn't need the subquery here:

SELECT COUNT(account_id) AS thecnt, 
     IF(COUNT(account_id) < 10, COUNT(account_id), 'tenormore')
   FROM sales
     WHERE created >= SUBDATE( CURRENT_DATE(), INTERVAL 60 DAY )
   GROUP BY account_id 
   ORDER BY thecnt DESC

Hope this helps.

Joshua Martell
If your primary key starts with account_id (account_id, created, ...), the disk ordering of the data will be close to the aggregate function and the filesort won't have as much to do. You might also consider using triggers to keep a summary table up to date with order counts.
Joshua Martell
His original outer query is grouping by count. It answers the business question, "How many customers ordered (1, 2, ..., 10 or more) items in the last 60 days?" This is different from "How many items did each customer order in the last 60 days?" which is what the inner query (and your query) returns.
wuputah
Wuputah is exactly right-I'm trying to group by the counts in order to build the histogram. Thanks, however, for taking a shot at the question!
Neil Kodner