views:

3719

answers:

11

I understand the best way to count the number of rows in an SQL table is count(*) (or equivalently count(PrimaryKey)).

  1. Is this O(1)?
  2. If not, why not?

Why not just implement a counter and return it for this specific query? Is it because this query is not a common use case?

If the answers vary according to SQL engine, I'd like to hear the differences - but in any case, I'm interested in the actual implementation in production SQL engines.

+1  A: 

For Oracle it is at generally O(N) unless the query result is in cache, as it essentially has to iterate all the blocks, or iterate the index to count them.

EvilTeach
A: 

A database could store the number of rows in a table and respond O(1) to select count(*) From MyTable

But, really, what good would that do them? Any variation from that (say select count(*) from MyTable where Category = 5) would require a full table scan (or index scan) and would be O(N).

James Curran
I know it COULD, I'm asking if they do, and if not (as I understand), why not.
ripper234
No, they should NOT store the numbers of row, see my answer.
Cd-MaN
Ahem... The "however..." line was the important part of the answer.
James Curran
+8  A: 

No this is not a common use case. Most row counts I have seen have some where clause involved.

The main reason this is not implemented though is the row counter would be a cause of contention in a multi user environment. Every time a row was inserted or deleted the counter would need updating effectively locking the whole table for each insert/delete.

James Anderson
+8  A: 

In some RDBM's this is O(1) (most notably MySQL), put AFAIK it is generally frowned upon and considered an "ugly performance hack". The reasons is that if you have transactions (which every real RDBM should have), the total number of rows in the table might or might not be equal to the total number you can see from the current transaction. This is why the server needs to check which rows are actually visible to your transaction, making it more O(n) than O(1).

If you want to optimize the process of getting the number of rows and are satisfied with an approximate result, most RDBM's have special "info" tables which hold information about the tables, including the approximate number of rows (again, it is not the exact number of rows because of the transactions).

Cd-MaN
Note that also in MySQL it is not O(1). However, if you have an auto-incremented ID field, you can use a little trick and do "select id from table order by id desc limit 1".
ripper234
MySQL is O(1) if MyISAM is used - from the manual http://dev.mysql.com/doc/refman/5.1/en/group-by-functions.html#function_count :"COUNT(*) is optimized to return very quickly if the SELECT retrieves from one table, no other columns are retrieved, and there is no WHERE clause."..."This optimization applies only to MyISAM tables only, because an exact row count is stored for this storage engine and can be accessed very quickly."
Cd-MaN
A: 

There is a (inaccurate) shortcut in SQL Server where you can look at the count in the metadata sys.partitions for a particular object like an index on a table.

The operation is O(1), but is only an estimate.

Cade Roux
+1  A: 

It's normally O(N).

If O(1) response to such query is needed, you can easily accomplish it either using:

  • An indexed view that counts the query.
  • Store count manually in another table and update the row count using a trigger:

Example:

CREATE TABLE CountingTable ( Count int )

INSERT CountingTable VALUES(0)

CREATE TRIGGER Counter ON Table FOR INSERT, UPDATE, DELETE AS
BEGIN
   DECLARE @added int, @Removed int
   SET @added = SELECT COUNT(*) FROM inserted
   SET @removed = SELECT COUNT(*) FROM deleted
   UPDATE CountingTable SET Count = Count + @added - @removed
END
Mehrdad Afshari
The trigger is a serialising process though, and would either lead to contention or to an inaccurate count
David Aldridge
It would reduce performance considerably with huge concurrent transactions, I think. But wrong count, I don't think. It will return the count in your transaction, at least, if your isolation level is >= READ_COMMITED
Mehrdad Afshari
Hmmm, how about if you truncate the table?
David Aldridge
When you come up with such an ugly hack, you shouldn't truncate the table. You can even modify the count directly :). Anyway, I didn't endorse it, but with normal use, it shouldn't generate wrong results. If your isolation level is READ_UNCOMMITED, you may end up with race condition and wrong count.
Mehrdad Afshari
+3  A: 

The performance of a COUNT(*) based on an index or on a table really depends on the segment size. You can have a 1GB table that only has a single row in it, but Oracle might still have to scan the entire allocated space. Inserting another million rows might not affect performance at all if it does not alter the high-water mark. Indexes work in a similar manner, where different patterns of deletes can leave different amounts of free space in the index structure and cause index scans to give better or worse performance than O(N).

So, theoretically it's O(N). In practice there are iplementation issues that can cause it to be very different.

For example there are some cases where an Oracle data warehouse might give better than O(N) performance. In particular the optimizer could scan a bitmap index, and the size of a bitmap index is only weakly related to the size of the table, unlike a b-tree index. This is because of the compression methodology which makes the index size dependent on the size of the table, the number of unique values, the distribution of values throughout the table, and the historical loading pattern as well I believe. So, doubling the number of rows in the table might only increase the size of the index by 10%.

In the presence of a materialized view you might also get O(1) by reading a summary table (a trigger is an unsafe way of doing this).

David Aldridge
A: 

With Informix, in the absence of complicating factors such as LBAC (Label-Based Access Control), then SELECT COUNT(*) FROM SomeTable is O(1); it pulls the information from the control information it maintains. If there's a WHERE clause or LBAC or the table is a view or any of a number of other factors, then it ceases to be O(1).

Jonathan Leffler
A: 

Apparently O(N) on PostgreSQL:

=> explain select count(*) from tests;
                         QUERY PLAN                              
---------------------------------------------------------------------
Aggregate  (cost=37457.88..37457.89 rows=1 width=0)
  ->  Seq Scan on tests  (cost=0.00..33598.30 rows=1543830 width=0)
(2 rows)

(Seq Scan means it has to scan the whole table)

bortzmeyer
+2  A: 

It's not constant time, because in transactional engines it needs to check how many rows exist in the current transaction, which generally involves a whole table scan.

Optimising COUNT(*) with no where clause is not a particularly useful optimisation for a database to do at the expense of other things; users of large tables rarely do such a query and it would not help at all if there was a WHERE clause present.

MyISAM in MySQL does "cheat" by storing the exact row count, but it can only do this because it doesn't have MVCC therefore doesn't need to worry about which transactions the rows are in.

MarkR
+3  A: 

In MS SQL server, doing a Count(*) on a table always does an index scan (on primary key) or a table scan (both bad). For large tables, this can take awhile.

Instead, there is a cool trick to show the current number of records nearly instantly (same one that Microsoft uses when you right click on the table and select properties):

--SQL 2005 or 2008
select sum (spart.rows)
from sys.partitions spart
where spart.object_id = object_id('YourTable')
and spart.index_id < 2

--SQL 2000
select max(ROWS) from sysindexes
where id = object_id('Resolve_Audit')

This number may be slightly off depending on how often SQL updates the index statistics, but if you need a ballpark, and not an exact number, these work great.

BradC