views:

167

answers:

2

Hi all, I am running sql queries on a mysql db table that has 110Mn+ unique records for whole day.

Problem: Whenever I run any query with "where" clause it takes at least 30-40 mins. Since I want to generate most of data on the next day, I need access to whole db table.

Could you please guide me to optimize / restructure the deployment model?

Site description:

mysql  Ver 14.12 Distrib 5.0.24, for pc-linux-gnu (i686) using readline 5.0
4 GB RAM, 
Dual Core dual CPU 3GHz
RHEL 3

my.cnf contents :

[mysqld]
datadir=/data/mysql/data/
socket=/tmp/mysql.sock

sort_buffer_size = 2000000
table_cache = 1024
key_buffer = 128M
myisam_sort_buffer_size = 64M

# Default to using old password format for compatibility with mysql 3.x
# clients (those using the mysqlclient10 compatibility package).
old_passwords=1

[mysql.server]
user=mysql
basedir=/data/mysql/data/

[mysqld_safe]
err-log=/data/mysql/data/mysqld.log
pid-file=/data/mysql/data/mysqld.pid
[root@reports root]#

DB table details:

CREATE TABLE `RAW_LOG_20100504` (
  `DT` date default NULL,
  `GATEWAY` varchar(15) default NULL,
  `USER` bigint(12) default NULL,
  `CACHE` varchar(12) default NULL,
  `TIMESTAMP` varchar(30) default NULL,
  `URL` varchar(60) default NULL,
  `VERSION` varchar(6) default NULL,
  `PROTOCOL` varchar(6) default NULL,
  `WEB_STATUS` int(5) default NULL,
  `BYTES_RETURNED` int(10) default NULL,
  `RTT` int(5) default NULL,
  `UA` varchar(100) default NULL,
  `REQ_SIZE` int(6) default NULL,
  `CONTENT_TYPE` varchar(50) default NULL,
  `CUST_TYPE` int(1) default NULL,
  `DEL_STATUS_DEVICE` int(1) default NULL,
  `IP` varchar(16) default NULL,
  `CP_FLAG` int(1) default NULL,
  `USER_LOCATE` bigint(15) default NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1 MAX_ROWS=200000000;

Thanks in advance! Regards,

+2  A: 

Add an index to any field that is in your where clause. Primary keys need to be unique; unique indexes need to be unique but uniqueness is not a prerequisite for an index.

Badly defined or non-existent indexes are one of the primary reasons for poor performance, and fixing these can often lead to phenomenal improvements

Quick info:

Frank V
@Frank: Thanks for response. Do you feel changing db engine from myisam to innodb would help? Only reason behind myisam engine was to support 100Mn+ records in single db table. I would prefer to run more than 1 simultaneous query on the same table without impacting other ongoing queries.
Vijay Gharge
I'm not entirely sure about changing the engine. I use MyISAM my self but I do not have any tables in a MySql database even near the size you do. So, I'm not the best person to answer such a question... That said, I'm sure you'd see an improvement by simply adding a few indexes...
Frank V
Vijay Gharge
Vijay Gharge
@Vijay: In the future, when you address a specific person, please add comments to the respective answer. I almost didn't notice you were asking me something.
Bill Karwin
Switching the table from MyISAM to InnoDB will indeed let you run simultaneous `SELECT` statements. This is also possible for MyISAM tables, but the conditions that permit that are more complex and rarely met.
staticsan
@Staticsan - Thanks for response. Please note that I am referring to only 1 table with 100Mn+ unique records. Please let me know your feedback.
Vijay Gharge
+6  A: 

I would encourage you to learn how to use EXPLAIN to analyze the database's plan for query optimization. Also see Baron Schwartz' presentation EXPLAIN Demystified (link to PDF of his slides is on that page).

Learn how to create indexes -- this is not the same thing as a primary key or an auto-increment pseudokey. See the presentation More Mastering the Art of Indexing by Yoshinori Matsunobu.

Your table could use an index on CP_FLAG and WEB_STATUS.

CREATE INDEX CW ON RAW_LAW_20100503 (CP_FLAG, WEB_STATUS);

This helps to look up the subset of rows based on your cp_flag condition.

Then you still run into MySQL's unfortunate inefficiency with GROUP BY queries. It copies an interim result set into a temporary file on disk and sorts it there. Disk I/O tends to kill performance.

You can raise your sort_buffer_size configuration parameter until it's large enough that MySQL can sort the result set in memory instead of on disk. But that might not work.

You might have to resort to precalculating the COUNT() you need, and update this statistic periodically.


The comment from @Marcus gave me another idea. You're grouping by web status, and the set of distinct values of web status is a fairly short list and they don't change. So you could run a separate query for each distinct value and generate the results you need much faster than by using a GROUP BY query that creates a temp table to do the sorting. Or you could run a subquery for each status value, and UNION them together:

(SELECT COUNT(*), WEB_STATUS FROM RAW_LOG_20100504 WHERE CP_FLAG > 0 AND WEB_STATUS = 200)
UNION
(SELECT COUNT(*), WEB_STATUS FROM RAW_LOG_20100504 WHERE CP_FLAG > 0 AND WEB_STATUS = 404)
UNION
(SELECT COUNT(*), WEB_STATUS FROM RAW_LOG_20100504 WHERE CP_FLAG > 0 AND WEB_STATUS = 304)
UNION
...etc...
ORDER BY 1 DESC;

Because your covering index includes CP_FLAG and WEB_STATUS, these queries never need to read the actual rows in the table. They only read entries in the index, which they can access much faster because (a) they're in a sorted tree, and (b) they may be cached in memory if you allocate enough to your key_buffer_size.

The EXPLAIN report I tried (with 1M rows of test data) shows that this uses indexes well, and does not create a temp table:

+------+--------------+------------------+------+--------------------------+
| id   | select_type  | table            | key  | Extra                    |
+------+--------------+------------------+------+--------------------------+
|  1   | PRIMARY      | RAW_LOG_20100504 | CW   | Using where; Using index |
|  2   | UNION        | RAW_LOG_20100504 | CW   | Using where; Using index |
|  3   | UNION        | RAW_LOG_20100504 | CW   | Using where; Using index |
| NULL | UNION RESULT | <union1,2,3>     | NULL | Using filesort           |
+------+--------------+------------------+------+--------------------------+

The Using filesort for the last line just means it has to sort without the benefit of an index. But sorting the three rows produced by the subqueries is trivial and MySQL does it in memory.


When designing optimal database solutions, there are rarely simple answers. A lot depends on how you use the data and what kind of queries are of higher priority to make fast. If there were a single, simple answer that worked in all circumstances, the software would just enable that design by default and you wouldn't have to do anything.

You really need to read a lot of manuals, books and blogs to understand how to take most advantage of all the features available to you.


Yes, I would still recommend using indexes. Clearly it was not working before, when you were querying 100 million rows without the benefit of an index.

You have to understand that you must design indexes that benefit the specific query you want to run. I have no way of knowing if the index you just described in your comment is appropriate, because you haven't shown the other query you're trying to speed up.

Indexing is a complex topic. If you define the index on the wrong columns, or if you get the columns in the wrong order, it may not be usable by a given query. I've been supporting SQL developers since 1994, and I've never found a single, concise rule to explain how to design indexes.

You seem like you need a mentor, because you're at a stage where you need a lot of questions answered. Is there someone where you work that you could ask to help you?

Bill Karwin
@Bill, does COUNT(*) use the covering index?
Marcus Adams
@ Bill: Sorry for that. I got lost because of many "Add comment" options. I tried using 'CREATE INDEX USER ON RAW_LOG_20100503 (MSISDN,BYTES_RETURNED,REQ_SIZE);'but even after 41614 seconds it is not finished. I had to abort query using "ctrl + c" Would you still recommend me the use of index? It looks like indexes on 100Mn+ records do not give best performance. One more thing i.e. I would be using new table everyday. How will indexing work in that situation?
Vijay Gharge
@ Bill : I understand that there won't be "one solution suits all" approach. But I want to understand how we can optimize mysql in my case. Hence request your valuable views/feedback that will help me improving mysql db performance.
Vijay Gharge
Vijay Gharge