I would encourage you to learn how to use EXPLAIN
to analyze the database's plan for query optimization. Also see Baron Schwartz' presentation EXPLAIN Demystified (link to PDF of his slides is on that page).
Learn how to create indexes -- this is not the same thing as a primary key or an auto-increment pseudokey. See the presentation More Mastering the Art of Indexing by Yoshinori Matsunobu.
Your table could use an index on CP_FLAG
and WEB_STATUS
.
CREATE INDEX CW ON RAW_LAW_20100503 (CP_FLAG, WEB_STATUS);
This helps to look up the subset of rows based on your cp_flag condition.
Then you still run into MySQL's unfortunate inefficiency with GROUP BY
queries. It copies an interim result set into a temporary file on disk and sorts it there. Disk I/O tends to kill performance.
You can raise your sort_buffer_size
configuration parameter until it's large enough that MySQL can sort the result set in memory instead of on disk. But that might not work.
You might have to resort to precalculating the COUNT()
you need, and update this statistic periodically.
The comment from @Marcus gave me another idea. You're grouping by web status, and the set of distinct values of web status is a fairly short list and they don't change. So you could run a separate query for each distinct value and generate the results you need much faster than by using a GROUP BY
query that creates a temp table to do the sorting. Or you could run a subquery for each status value, and UNION
them together:
(SELECT COUNT(*), WEB_STATUS FROM RAW_LOG_20100504 WHERE CP_FLAG > 0 AND WEB_STATUS = 200)
UNION
(SELECT COUNT(*), WEB_STATUS FROM RAW_LOG_20100504 WHERE CP_FLAG > 0 AND WEB_STATUS = 404)
UNION
(SELECT COUNT(*), WEB_STATUS FROM RAW_LOG_20100504 WHERE CP_FLAG > 0 AND WEB_STATUS = 304)
UNION
...etc...
ORDER BY 1 DESC;
Because your covering index includes CP_FLAG
and WEB_STATUS
, these queries never need to read the actual rows in the table. They only read entries in the index, which they can access much faster because (a) they're in a sorted tree, and (b) they may be cached in memory if you allocate enough to your key_buffer_size
.
The EXPLAIN
report I tried (with 1M rows of test data) shows that this uses indexes well, and does not create a temp table:
+------+--------------+------------------+------+--------------------------+
| id | select_type | table | key | Extra |
+------+--------------+------------------+------+--------------------------+
| 1 | PRIMARY | RAW_LOG_20100504 | CW | Using where; Using index |
| 2 | UNION | RAW_LOG_20100504 | CW | Using where; Using index |
| 3 | UNION | RAW_LOG_20100504 | CW | Using where; Using index |
| NULL | UNION RESULT | <union1,2,3> | NULL | Using filesort |
+------+--------------+------------------+------+--------------------------+
The Using filesort
for the last line just means it has to sort without the benefit of an index. But sorting the three rows produced by the subqueries is trivial and MySQL does it in memory.
When designing optimal database solutions, there are rarely simple answers. A lot depends on how you use the data and what kind of queries are of higher priority to make fast. If there were a single, simple answer that worked in all circumstances, the software would just enable that design by default and you wouldn't have to do anything.
You really need to read a lot of manuals, books and blogs to understand how to take most advantage of all the features available to you.
Yes, I would still recommend using indexes. Clearly it was not working before, when you were querying 100 million rows without the benefit of an index.
You have to understand that you must design indexes that benefit the specific query you want to run. I have no way of knowing if the index you just described in your comment is appropriate, because you haven't shown the other query you're trying to speed up.
Indexing is a complex topic. If you define the index on the wrong columns, or if you get the columns in the wrong order, it may not be usable by a given query. I've been supporting SQL developers since 1994, and I've never found a single, concise rule to explain how to design indexes.
You seem like you need a mentor, because you're at a stage where you need a lot of questions answered. Is there someone where you work that you could ask to help you?