views:

2300

answers:

10

How to speed up select count(*) with group by?
It's too slow and is used very frequently.
I have a big trouble using select count(*) and group by with a table having more than 3,000,000 rows.

select object_title,count(*) as hot_num   
from  relations 
where relation_title='XXXX'   
group by object_title

relation_title, object_title is varchar. where relation_title='XXXX', which returns more than 1,000,000 rows, lead to the indexes on object_title could not work well.

A: 

there is a point at which you truly need more RAM/CPUs/IO. You may have hit that for your hardware.

I will note that it usually isn't effective to use indexes (unless they are covering) for queries that hit more than 1-2% of the total rows in a table. If your large query is doing index seeks and bookmark lookups, it could be because of a cached plan that was from just a day-total query. Try adding in WITH (INDEX=0) to force a table scan and see if it is faster.

take this from : http://www.microsoft.com/communities/newsgroups/en-us/default.aspx?dg=microsoft.public.sqlserver.programming&tid=4631bab4-0104-47aa-b548-e8428073b6e6&cat=&lang=&cr=&sloc=&p=1

Haim Evgi
I thought this was MS SQL to start with but the poster has added the mysql tag...
Joel Mansford
Note that the question is tagged "mysql" not "mssql".
Kosi2801
yes, 'mysql'. I have try "force index(primary)" to have mysql not using index by itself. It's effective, 20s up to 15s.
ZA
A: 

If you what the size of the whole table, you should query the meta tables or info schema (that exist on every DBMS I know, but I'm not sure about MySQL). If your query is selective, you have to make sure there is an index for it.

AFAIK there is nothing more you can do.

Tim Büthe
+9  A: 

Indexing the columns in the GROUP BY clause would be the first thing to try, using a composite index. A query such as this can potentially be answered using only the index data, avoiding the need to scan the table at all. Since the records in the index are sorted, the DBMS should not need to perform a separate sort as part of the group processing. However, the index will slow down updates to the table, so be cautious with this if your table experiences heavy updates.

If you use InnoDB for the table storage, the table's rows will be physically clustered by the primary key index. If that (or a leading portion of it) happens to match your GROUP BY key, that should speed up a query such as this because related records will be retrieved together. Again, this avoids having to perform a separate sort.

In general, bitmap indexes would be another effective alternative, but MySQL does not currently support these, as far as I know.

A materialized view would be another possible approach, but again this is not supported directly in MySQL. However, if you did not require the COUNT statistics to be completely up-to-date, you could periodically run a CREATE TABLE ... AS SELECT ... statement to manually cache the results. This is a bit ugly as it is not transparent, but may be acceptable in your case.

You could also maintain a logical-level cache table using triggers. This table would have a column for each column in your GROUP BY clause, with a Count column for storing the number of rows for that particular grouping key value. Every time a row is added to or updated in the base table, insert or increment/decrement the counter row in the summary table for that particular grouping key. This may be better than the fake materialized view approach, as the cached summary will always be up-to-date, and each update is done incrementally and should have less of a resource impact. I think you would have to watch out for lock contention on the cache table, however.

cheduardo
Smaller columns may help: if the table scan is unavoidable, a smaller table will take less time to scan. Perhaps you could post the table structure and some sample data along with the exact query.
cheduardo
+5  A: 

If you have InnoDB, count(*) and any other aggregate function will do a table scan. I see a few solutions here:

  1. Use triggers and store aggregates in a separate table. Pros: integrity. Cons: slow updates
  2. Use processing queues. Pros: fast updates. Cons: old state can persist until the queue is processed so the user may feel a lack of integrity.
  3. Fully separate the storage access layer and store aggregates in a separate table. The storage layer will be aware of the data structure and can apply deltas instead of doing full counts. For example if you provide an "addObject" functionality within that you will know when an object has been added and thus the aggregate would be affected. Then you do only an update table set count = count + 1. Pros: fast updates, integrity (you may want to use a lock though in case several clients can alter the same record). Cons: you couple a bit of business logic and storage.
Sorin Mocanu
+1, ive gota try that concept...I have similar issues
Shahmir Javaid
+1  A: 

test count(myprimaryindexcolumn) and compare performance to your count(*)

Mark Schultheiss
+15  A: 

Here are several things I'd try, in order of increasing difficulty:

(easier) - Make sure you have the right covering index

CREATE INDEX ix_temp ON relations (relation_title, object_title);

This should maximize perf given your existing schema, since (unless your version of mySQL's optimizer is really dumb!) it will minimize the amount of I/Os needed to satisfy your query (unlike if the index is in the reverse order where the whole index must be scanned) and it will cover the query so you won't have to touch the clustered index.

(a little harder) - make sure your varchar fields are as small as possible

One of the perf challenges with varchar indexes on MySQL is that, when processing a query, the full declared size of the field will be pulled into RAM. So if you have a varchar(256) but are only using 4 chars, you're still paying the 256-byte RAM usage while the query is being processed. Ouch! So if you can shrink your varchar limits easily, this should speed up your queries.

(harder) - Normalize

30% of your rows having a single string value is a clear cry for normalizing into another table so you're not duplicating strings millions of times. Consider normalizing into three tables and using integer IDs to join them.

In some cases, you can normalize under the covers and hide the normalization with views which match the name of the current table... then you only need to make your INSERT/UPDATE/DELETE queries aware of the normalization but can leave your SELECTs alone.

(hardest) - Hash your string columns and index the hashes

If normalizing means changing too much code, but you can change your schema a little bit, you may want to consider creating 128-bit hashes for your string columns (using the MD5 function). In this case (unlike normalization) you don't have to change all your queries, only the INSERTs and some of the SELECTs. Anyway, you'll want to hash your string fields, and then create an index on the hashes, e.g.

CREATE INDEX ix_temp ON relations (relation_title_hash, object_title_hash);

Note that you'll need to play around with the SELECT to make sure you are doing the computation via the hash index and not pulling in the clustered index (required to resolve the actual text value of object_title in order to satisfy the query).

Also, if relation_title has a small varchar size but object title has a long size, then you can potentially hash only object_title and create the index on (relation_title, object_title_hash).

Note that this solution only helps if one or both of these fields is very long relative to the size of the hashes.

Also note that there are interesting case-sensitivity/collation impacts from hashing, since the hash of a lowercase string is not the same as a hash of an uppercase one. So you'll need to make sure you apply canonicalization to the strings before hashing them-- in otherwords, only hash lowercase if you're in a case-insensitive DB. You also may want to trim spaces from the beginning or end, depending on how your DB handles leading/trailing spaces.

Justin Grant
The covering index Justin mentions here is absolutely the best way to get good performance out of this query.
BradC
+2  A: 

I see that a few individuals have asked what engine you were using for the query. I would highly recommend you use MyISAM for the following reasions:

InnoDB - @Sorin Mocanu properly identified that you will do a full table scan regardless of indexes.

MyISAM - always keeps the current row count handy.

Lastly, as @justin stated, make sure you have the proper covering index:

CREATE INDEX ix_temp ON relations (relation_title, object_title);
cballou
FYI, the huge speed benefit of MyISAM for COUNT(*) queries only applies when you're counting rows in the entire table. If there's a WHERE clause, then both MyISAM and InnoDB compute the count by counting rows in the index. See http://www.mysqlperformanceblog.com/2006/12/01/count-for-innodb-tables/ for more info.
Justin Grant
A: 

I would suggest to archive data unless there is any specific reason to keep it in the database or you could partition the data and run queries separately.

SoftwareGeek
+1  A: 
Backway
+1  A: 

One of the perf challenges with varchar indexes on MySQL is that, when processing a query, the full declared size of the field will be pulled into RAM. So if you have a varchar(256) but are only using 4 chars, you're still paying the 256-byte RAM usage while the query is being processed. Ouch! So if you can shrink your varchar limits easily, this should speed up your queries

Backway