views:

383

answers:

1

I would like to know what the complexity of the following two operations are. The first case is a count where I order by a column that I have an index on and ask for the count of all values below or above a certain number like this:

SELECT count(*) FROM tbl WHERE col1 > 10 ORDER BY col1;

The other case is concerning a median operation. By median I mean finding the (int)n/2's row value, where n is the number of rows in the table. An example of this could be the following (again there's an index on col1):

SELECT median(col1) FROM tbl ORDER BY col1;

What is the worst case complexities of these cases?

+2  A: 

The ORDER BY clauses are unnecessary - or confusing, or both.

SELECT COUNT(*) will return a single row (normally). Because you have a criterion on the search, the optimizer may have to do an index scan of col1 (if there is an index with col1 as the leading column of the index), or a table scan. That is an O(N) operation, where N is the number of rows in the table.

SELECT MEDIAN(col1) will also return a single row (normally). It will be an O(N) operation, again using an index scan or a table scan.

The 'normally' qualifier is there because I'm not absolutely sure what the optimizer will do with the ORDER BY clauses. One possibility is that the optimizer will determine that it is redundant and ignore it. The other possibility is that it will somehow add the col1 that you ORDER BY to the projection columns, include it in the other operations, and then remove it before returning results. However, that would run foul of mixing aggregates and non-aggregates without a GROUP BY clause - so I think the optimizer will ignore it, or reject the query. However, I've not done the experiment with MySQL.

FWIW, IBM Informix Dynamic Server (IDS) yields error -19828: ORDER BY column or expression must be in SELECT list in this context.

Without the ORDER BY clauses, the analysis above is accurate enough. Note that for SELECT COUNT(*) with no criteria, the server can often use metadata it keeps about the table to answer the query in O(1) time.

Jonathan Leffler
There is nothing in the SQL standard which mandates count(*) is O(n). If the DBMS chooses to store row count as per-table metadata, it's O(1). Even the where clause won't necessarily make it O(n) since there are O(1) ways to find the logical record of the first "11" record, then subtract from count.
paxdiablo
Concerning median my thought were that it might be possible to just take middle node in b-tree and it would then take log n to get to the middle node. It would be middle node as it's indexed by the number, thus ordered as well.
Per Stilling
Having commented, I just noticed the "worst case" in the question, so @JL is actually correct. Worst case is almost always full table scan, O(n). +1.
paxdiablo
My thoughts on count was that maybe the B-trees which usually are used for indexes would have a count in each node of the total number of containing sub-records. Should be able to maintain in log or constant time. My goal is to figure out how it has been implemented complexity-wise.
Per Stilling
Finding out is easy, @Per. Measure the time taken for 1, 10, 100, 1000, 10000 and 100000 records and see how they stack up against each other.
paxdiablo