Usually it's a Stream Aggregate
or a Hash Aggregate
.
Stream aggregate
sorts the resultset, scans it and returns every new value (not equal to the last in scan). It allows to keep but one set of the aggregate state variables.Hash aggregate
builds a hash table from the resultset. Each entry keeps the aggregate state variables which are initialized on hash miss and updated on hash hit.
Let's see how AVG
works. It needs two state variables: sum
and count
grouper value
1 4
1 3
2 8
1 7
2 1
1 2
2 6
2 3
Stream Aggregate
First, it needs to sort the values:
grouper value 1 4 1 3 1 7 1 2 2 8 2 1 2 6 2 3
Then, it keeps one set of state variables, initialized to
0
, and scans the sorted resultset:grouper value sum count -- Entered -- Variables: 0 0 1 4 4 1 1 3 7 2 1 7 14 3 1 2 16 4 -- Group change. Return the result and reinitialize the variables -- Returning 1, 4 -- Variables: 0 0 2 8 8 1 2 1 9 2 2 6 15 3 2 3 18 4 -- Group change. Return the result and reinitialize the variables -- Returning 2, 4.5 -- End
Hash aggregate
Just scanning the values and keeping the state variables in the hash table:
grouper value -- Hash miss. Adding new entry to the hash table -- [1] (0, 0) -- ... and updating it: 1 4 [1] (4, 1) -- Hash hit. Updating the entry: 1 3 [1] (7, 2) -- Hash miss. Adding new entry to the hash table -- [1] (7, 2) [2] (0, 0) -- ... and updating it: 2 8 [1] (7, 2) [2] (8, 1) 1 7 [1] (14, 3) [2] (8, 1) 2 1 [1] (14, 3) [2] (9, 2) 1 2 [1] (16, 4) [2] (9, 2) 2 6 [1] (16, 4) [2] (15, 3) 2 3 [1] (16, 4) [2] (18, 4) -- Scanning the hash table and returning the aggregated values -- 1 4 -- 2 4.5
Usually, sort is faster if the resultset is already ordered (like, the values come out of the index or a resultset sorted by the previous operation).
Hash is faster is the resultset is not sorted (hashing is faster than sorting).
MIN
and MAX
are special cases, since they don't require scanning the whole group: only the first and the last value of the aggregated column within the group.
Unfortunately, SQL Server
, unlike most other systems, cannot utilize this efficiently, since it's not good in doing INDEX SKIP SCAN
(jumping over distinct index keys).
While simple MAX
and MIN
(without GROUP BY
clause) use a TOP
method if the index on the aggregated column is present, MIN
and MAX
with GROUP BY
use same methods as other aggregate functions do.