views:

147

answers:

4

What is the time complexity of a function such as count, sum, avg or any other of the built in "math"-functions in mysql, sql server, oracle and others?

One would think that calling sum(myColumn) would be Linear?

But count(1) isn't, how come and what are the Real time-complexity?

In a perfect world I would want sum, avg and count to be O(1), but we don't live in one of those, do we?

+1  A: 

Note: this is speculation based on my understanding of how SQL query planners work, and may not be entirely accurate.

I believe all the aggregate functions, or at least the "math" ones you name above, should be O(n). The query will be executed roughly as follows:

  1. Fetch rows matching the join predicates and filter predicates (ie "WHERE clause")
  2. Create row-groups according to the GROUP BY clause. A single row-group is created for queries with no GROUP BY
  3. For each row group, apply the aggregate function to the rows in the group. For things like SUM, AVG, MIN, MAX as well as non-numeric functions like CONCAT there are simple O(n) algorithms, and I suspect those are used. Create one row in the output set for each row-group created in step #2
  4. If a HAVING predicate is present, filter output rows using this predicate

Note, however, that even though the aggregate functions are O(n), the operation might not be. If you create a query which cartesian joins a table to itself, you will be looking at O(n*n) minimum just to create the initial row set (step #1). Sorting to create row-groups (step #2) may be O(n lg n), and may require disk storage for the sort operation (as opposed to in-memory operation only), so your query may still perform poorly if you are manipulating many rows.

dcrosta
+3  A: 

In SQL the math function complexity of aggregates is totally irelevant. The only thing that really matters is the data acceess complexity: what access path is chosen (table scan, index range scan, index seek etc) and how many pages are read. There may be slight differences in the internals of each aggregate, but they all work pretty much the same way (keep running state and compute running aggregation for each input value) and there is absolutely NO aggregate that looks at input twice, so they all O(n) as internal implementation, where 'n' is the number of recordes fed to the aggregate (not necesarily the number of record in the table!).

Some aggregates have internal shortcuts, eg. COUNT(*) may return the count from metadata on some systems, if possible.

Remus Rusanu
A: 

For big data-warehouse style queries, the major databases can parallelize the task, so have multiple CPUs working on it. There will therefore be threshold points where it isn't quite linear as the cost of co-ordinating the parallel threads trades off against the benefit of using the multiple CPUs.

Gary
+1  A: 

What is the time complexity of a function such as count, sum, avg or any other of the built in "math"-functions in mysql, sql server, oracle and others?

  • In MySQL with MyISAM, COUNT(*) without GROUP BY is O(1) (constant)

    It is stored in the table metadata.

  • In all systems, MAX and MIN on indexed expressions without GROUP BY are O(log(n)) (logarithmic).

    They are fetched with a single index seek.

  • Aggregate functions are O(n) (linear), when used without GROUP BY or GROUP BY uses HASH

  • Aggregate functions are O(n log(n)) when GROUP BY uses SORT.

All values should be fetched, calculated and stored in the state variables (which may be stored in a hash table).

In addition, when using SORT, they should also be sorted.

Quassnoi