views:

124

answers:

5

Assuming indexes are put in place, and absolute-count-accuracy is not necessary (it's okay to be off by one or two), is it okay to use:

Option A

select count(*) 
  from Table 
 where Property = @Property

vs

Option B

update PropertyCounters
   SET PropertyCount = PropertyCount + 1  
 where Property = @Property

then doing:

select PropertyCount 
  from PropertyCounters 
 where Property = @Property

How much performance degradation can I reasonably expect from doing select count(*) as the table grows into thousands/millions of records?

+2  A: 

Keeping a separate count column in addition to the real data is a denormalisation. There are reasons why you might need to do it for performance, but you shouldn't go there until you really need to. It makes your code more complicated, with more chance of inconsistencies creeping in.

For the simple case where the query really is just SELECT COUNT(property) FROM table WHERE property=..., there's no reason to denormalise; you can make that fast by adding an index on the property column.

bobince
Thanks, Option A it is...index is in place. Let's see how it handles :-)
Lou
+2  A: 

You didn't specify the platform, but since you use T-SQL syntax for @variables I'll venture a SQL Server platform specific answer:

count(*), or strictly speaking would be count_big(*), is an expression that can be used in indexed views, see Designing Indexed Views.

create view vwCounts
with schembinding
as select Property, count_big(*) as Count
from dbo.Table
group by Property;

create unique clustered index cdxCounts on vwCounts(Property);

select Count 
from vwCount with (noexpand)
where Property = @property;

On Enterprise Edition the optimizer will even use the indexed view for your original query:

select count_big(*)
from Table
where Property = @property;

So in the end you get your cake and eat it too: the property is already aggregated and maintained for your for free by the engine. The price is that updates have to maintain the indexed view (they will not recompute the aggregate count though) and the aggregation will create hot spots for contention (locks on separate rows on Table will contend for same count(*) update on the indexed view).

Remus Rusanu
+1: But MySQL uses @ in variables too
OMG Ponies
I actually did not know that...
Remus Rusanu
Neither did I, but you were right, I was referring to T-SQL/SQL Server...thanks for your suggestion.
Lou
A: 

This isn't something that can be answered in general SQL terms. Quite apart from the normal caveats about indices and so on affecting queries, it's also something where there is considerable different between platforms.

I'd bet on better performance on this from SQL Server than Postgres, to the point where I'd consider the latter approach sooner on Postgres and not on SQL Server. However, with a partial index set just right for matching the criteria, I'd bet on Postgres beating out SQL Server. That's just what I'd bet small winnings on though, either way I'd test if I needed to think about it for real.

If you do go for the latter approach, enforce it with a trigger or similar, so that you can't become inaccurate.

Jon Hanna
A: 

If you say that you do not need absolute accuracy, then Option B is a strange approach. If Option A becomes too heavy (even after adding indexes), you can cache the output of Option A in memory or in another table (your PropertyCounters), and periodically refresh it.

Thilo
A: 

On SQL Server, if you don't need absolutely accurate counts, you could also inspect the catalog views. This would be much easier to do - you don't have to keep a count yourself - and it's a lot less taxing on the system. After all, if you need to count all the rows in a table, you need to scan that table, one way or another - no way around that.

With this SQL statement here, you'll get all the tables in your database, and their row counts, as kept by SQL Server:

SELECT 
    t.NAME AS TableName,
    SUM(p.rows) AS RowCounts
FROM 
    sys.tables t
INNER JOIN      
    sys.indexes i ON t.OBJECT_ID = i.object_id
INNER JOIN 
    sys.partitions p ON i.object_id = p.OBJECT_ID AND i.index_id = p.index_id
WHERE 
    t.NAME NOT LIKE 'dt%' AND
    i.OBJECT_ID > 255 AND   
    i.index_id <= 1
GROUP BY 
    t.NAME, i.object_id, i.index_id, i.name 
ORDER BY 
    OBJECT_NAME(i.object_id) 

I couldn't find any documentation on exactly how current those numbers are, typically - but from my own experience, they're usually on the spot (unless you're doing some bulk loading or something - but in that case, you wouldn't want to constantly scan the table to get the exact count, either)

marc_s
that's a very daunting approach. I'll have to look into that, thanks.
Lou