views:

2453

answers:

14

I know it's generally a bad idea to do queries like this:

SELECT * FROM `group_relations`

But when I just want the count, should I go for this query since that allows the table to change but still yields the same results.

SELECT COUNT(*) FROM `group_relations`

Or the more specfic

SELECT COUNT(`group_id`) FROM `group_relations`

I have a feeling the latter could potentially be faster, but are there any other things to consider?

Update: I am using InnoDB in this case, sorry for not being more specific.

+16  A: 

If I remember it right, in MYSQL COUNT(*) counts all rows, whereas COUNT(column_name) counts only the rows that have a non-NULL value in the given column.

Sebastian Dietz
+68  A: 

If the column in question is NOT NULL, both of your queries are equivalent. When group_id contains null values,

select count(*)

will count all rows, whereas

select count(group_id)

will only count the rows where group_id is not null.

Also, some database systems, like MySQL employ an optimization when you ask for count(*) which makes such queries a bit faster than the specific one.

Personally, when just counting, I'm doing count(*) to be on the safe side with the nulls.

pilif
The way mysql handles count is very different depending on what database engine you're using. In this answer it assumes MyISAM, but the question states INNODB. On the former count is fast and cached, the later it's quite slow.
rabble
true, but a) the InnoDB note wasn't there when I've written the response and b) while InnoDB cannot optimize the count, count(*) isn't any faster or slower than count(group_id). And even on MyISAM, you'd lose the speed advantage the moment you add a WHERE clause
pilif
count(*) it is then!thanks!
grapefrukt
also if you want to know whether a row exists thenSELECT COUNT(`group_id`) FROM `group_relations` WHERE ROWNUM=1
Suresh S
+8  A: 

if you try SELECT COUNT(1) FROM group_relations it will be a bit faster because it will not try to retrieve information from your columns.

Edit: I just did some research and found out that this only happens in some db. In sqlserver it's the same to use 1 or *, but on oracle it's faster to use 1.

http://social.msdn.microsoft.com/forums/en-US/transactsql/thread/9367c580-087a-4fc1-bf88-91a51a4ee018/

Apparently there is no difference between them in mysql, like sqlserver the parser appears to change the query to select(1). Sorry if I mislead you in some way.

Sergio
Faster than COUNT(*)? Why does COUNT(*) have to retrieve information from the columns?
Kent Boogaart
If MySQL retrieves column info for count(*), that would be a seriously good reason for avoiding it as a DBMS. I can't believe the developers would have been that stupid.
paxdiablo
+1  A: 

It should depend on what you are actually trying to achieve as Sebastian has already said, i.e. make your intentions clear! If you are just counting the rows then go for the COUNT(*), or counting a single column go for the COUNT(column).

It might be worth checking out your DB vendor too. Back when I used to use Informix it had an optimisation for COUNT(*) which had a query plan execution cost of 1 compared to counting single or mutliple columns which would result in a higher figure

MrWiggles
+8  A: 

COUNT(*) count all rows while COUNT(column_name) will count only rows without NULL values in the specified column.

Important to note in MySQL:

COUNT() is very fast on MyISAM tables for * or not-null columns, since the row count is cached. InnoDB has no row count caching, so there is no difference in performance for COUNT(*) or COUNT(column_name), regardless if the column can be null or not. You can read more on the differences on this post at the MySQL performance blog.

Eran Galperin
maybe a dumb question... how do i know if my mysql db is backed by innodb or myisam? i run on a shared server, so i didn't set it up myself...
Kip
InnoDB and MyISAM are table storage engines, MySQL has several of those - http://dev.mysql.com/doc/refman/5.0/en/storage-engines.html. Your database can have a mix of storage engines for different tables. Have a look in your phpMyAdmin to check the storage engine of your different tables
Eran Galperin
+1  A: 

MySQL ISAM tables should have optimisation for COUNT(*), skipping full table scan.

dmajkic
+1  A: 

if you try SELECT COUNT(1) FROM group_relations it will be a bit faster because it will not try to retrieve information from your columns.

COUNT(1) used to be faster than COUNT(*), but that's not true anymore, since modern DBMS are smart enough to know that you don't wanna know about columns

Rafael Mueller
+2  A: 

An asterisk in COUNT has no bearing with asterisk for selecting all fields of table. It's pure rubbish to say that COUNT(*) is slower than COUNT(field)

I intuit that select COUNT(*) is faster than select COUNT(field). If the RDBMS detected that you specify "*" on COUNT instead of field, it doesn't need to evaluate anything to increment count. Whereas if you specify field on COUNT, the RDBMS will always evaluate if your field is null or not to count it.

But if your field is nullable, specify the field in COUNT.

Michael Buen
A: 

COUNT(*) facts and myths:

MYTH: "InnoDB doesn't handle count(*) queries well":

Most count(*) queries are executed same way by all storage engines if you have a WHERE clause, otherwise you InnoDB will have to perform a full table scan.

FACT: InnoDB doesn't optimize count(*) queries without the where clause

Charles Faiga
+5  A: 

I was curious about this myself. It's all fine to read documentation and theoretical answers, but I like to balance those with empirical evidence.

I have a MySQL table (InnoDB) that has 5,607,997 records in it. The table is in my own private sandbox, so I know the contents are static and nobody else is using the server. I think this effectively removes all outside affects on performance. I have a table with an auto_increment Primary Key field (Id) that I know will never be null that I will use for my where clause test (WHERE Id IS NOT NULL).

The only other possible glitch I see in running tests is the cache. The first time a query is run will always be slower than subsequent queries that use the same indexes. I'll refer to that below as the cache Seeding call. Just to mix it up a little I ran it with a where clause I know will always evaluate to true regardless of any data (TRUE = TRUE).

That said here are my results:

QueryType

      |  w/o WHERE          | where id is not null |  where true=true

COUNT()

      |  9 min 30.13 sec ++ | 6 min 16.68 sec ++   | 2 min 21.80 sec ++
      |  6 min 13.34 sec    | 1 min 36.02 sec      | 2 min 0.11 sec 
      |  6 min 10.06 se     | 1 min 33.47 sec      | 1 min 50.54 sec

COUNT(Id)

      |  5 min 59.87 sec    | 1 min 34.47 sec      | 2 min 3.96 sec 
      |  5 min 44.95 sec    | 1 min 13.09 sec      | 2 min 6.48 sec

COUNT(1)

      | 6 min 49.64 sec    | 2 min 0.80 sec       | 2 min 11.64 sec
      | 6 min 31.64 sec    | 1 min 41.19 sec      | 1 min 43.51 sec

++This is considered the cache Seeding call. It is expected to be slower than the rest.

I'd say the results speak for themselves. COUNT(Id) usually edges out the others. Adding a Where clause dramatically decreases the access time even if it's a clause you know will evaluate to true. The sweet spot appears to be COUNT(Id)... WHERE Id IS NOT NULL.

I would love to see other peoples' results, perhaps with smaller tables or with where clauses against different fields than the field you're counting. I'm sure there are other variations I haven't taken into account.

Chris
Interesting test. I tried to duplicate your results using a table of over 80M records on Mysql 5.1.45 using Innodb. My results were very different, count(*) with no Where clause was 20% faster than any other method. Adding a Where clause was always slower. My server could count(*) 80M rows in 15.2 seconds. I too had an auto-inc unsigned int as primary key.
Gary
+1  A: 

The advice I got from MySQL about things like this is that, in general, trying to optimize a query based on tricks like this can be a curse in the long run. There are examples over MySQL's history where somebody's high-performance technique that relies on how the optimizer works ends up being the bottleneck in the next release.

Write the query that answers the question you're asking -- if you want a count of all rows, use COUNT(*). If you want a count of non-null columns, use COUNT(col) WHERE col IS NOT NULL. Index appropriately, and leave the optimization to the optimizer. Trying to make your own query-level optimizations can sometimes make the built-in optimizer less effective.

That said, there are things you can do in a query to make it easier for the optimizer to speed it up, but I don't believe COUNT is one of them.

Edit: The statistics in the answer above are interesting, though. I'm not sure whether there is actually something at work in the optimizer in this case. I'm just talking about query-level optimizations in general.

Jon
A: 

It is best to count by an indexed column such as a primary key.

SELECT COUNT(`group_id`) FROM `group_relations`
Paul Janaway
A: 

I know it's generally a bad idea to do queries like this:

SELECT * FROM `group_relations`

But when I just want the count, should I go for this query since that allows the table to change but still yields the same results.

SELECT COUNT(*) FROM `group_relations`

As your question implies, the reason SELECT * is ill-advised is that changes to the table could require changes in your code. That doesn't apply to COUNT(*). It's pretty rare to want the specialized behavior that SELECT COUNT('group_id') gives you - typically you want to know the number of records. That's what COUNT(*) is for, so use it.

Carl Manaster
+2  A: 

Seek Alternatives

As you've seen, when tables grow large, COUNT queries get slow. I think the most important thing is to consider the nature of the problem you're trying to solve. For example, many developers use COUNT queries when generating pagination for large sets of records in order to determine the total number of pages in the result set.

Knowing that COUNT queries will grow slow, you could consider an alternative way to display pagination controls that simply allows you to side-step the slow query. Google's pagination is an excellent example.

Denormalize

If you absolutely must know the number of records matching a specific count, consider the classic technique of data denormalization. Instead of counting the number of rows at lookup time, consider incrementing a counter on record insertion, and decrementing that counter on record deletion.

If you decide to do this, consider using idempotent, transactional operations to keep those denormalized values in synch.

BEGIN TRANSACTION;
INSERT INTO  `group_relations` (`group_id`) VALUES (1);
UPDATE `group_relations_count` SET `count` = `count` + 1;
COMMIT;

Alternatively, you could use database triggers if your RDBMS supports them.

Depending on your architecture, it might make sense to use a caching layer like memcached to store, increment and decrement the denormalized value, and simply fall through to the slow COUNT query when the cache key is missing. This can reduce overall write-contention if you have very volatile data, though in cases like this, you'll want to consider solutions to the dog-pile effect.

Duncan Beevers