views:

171

answers:

7

The complexity of methods in most programming languages can be measured in cyclomatic complexity with static source code analyzers. Is there a similar metric for measuring the complexity of a SQL query?

It is simple enough to measure the time it takes a query to return, but what if I just want to be able to quantify how complicated a query is?

[Edit/Note] While getting the execution plan is useful, that is not neccessarily what I am trying to identify in this case. I am not looking for how difficult it is for the server to execute the query, I am looking for a metric that identifies how difficult it was for the developer to write the query, and how likely it is to contain a defect.

[Edit/Note 2] Admittedly, there are times when measuring complexity is not useful, but there are also times when it is. For a further discussion on that topic, see this question.

+2  A: 

SQL queries are declarative rather than procedural: they don't specify how to accomplish their goal. The SQL engine will create a procedural plan of attack, and that might be a good place to look for complexity. Try examining the output of the EXPLAIN (or EXPLAIN PLAN) statement, it will be a crude description of the steps the engine will use to execute your query.

Ned Batchelder
"SQL queries are declarative rather than procedural" -- which is why you can't consider the SQL DML in isolation from the SQL DDL.
onedaywhen
A: 

Well if you're are using SQL Server I would say that you should look at the cost of the query in the execution plan (specifically the subtree cost).

Here is a link that goes over some of the things you should look at in the execution plan.

Abe Miessler
A: 

Depending on your RDBMS, there might be query plan tools that can help you analyze the steps the RDBMS will take in fetching your query.

SQL Server Management Studio Express has a built-in query execution plan. Pervasive PSQL has its Query Plan Finder. DB2 has similar tools (forgot what they're called).

Duracell
A: 

A good question. The problem is that for a SQL query like:

SELECT * FROM foo;

the complexity may depend on what "foo" is and on the database implementation. For a function like:

int f( int n ) {
   if ( n == 42 ) {
      return 0;
   }
   else {
      return n;
   }
}

there is no such dependency.

However, I think it should be possible to come up with some useful metrics for a SELECT, even if they are not very exact, and I'll be interested to see what answers this gets.

anon
I somewhat disagree about the `foo` example. That would be like factoring in the complexity of the called functions, when measuring the complexity of a procedural code.
pascal
+2  A: 

I'm not sure the retrieval of the query plans will answer the question: the query plans hide a part of the complexity about the computation performed on the data before it is returned (or used in a filter); the query plans require a significative database to be relevant. In fact, complexity, and length of execution are somewhat oppposite; something like "Good, Fast, Cheap - Pick any two".

Ultimately it's about the chances of making a mistake, or not understanding the code I've written?

Something like:

  • number of tables times (1
  • +1 per join expression (+1 per outer join?)
  • +1 per predicate after WHERE or HAVING
  • +1 per GROUP BY expression
  • +1 per UNION or INTERSECT
  • +1 per function call
  • +1 per CASE expression
  • )
pascal
This is exactly the kind of thing I am looking for. If I can't find one, I might brew mine own similar to this.
epotter
You also could remove some points(half a point?) for doing a search on an indexed field. And don't forget your Order By's too.
MPelletier
As someone mentionned, this measure would not be about the efficiency of the SQL statements. It's about their complexity, or the risk they present to the tests (e.g. missing a predicate, or using an inner instead of a left join, or the infamous *why is my simple query taking forever to execute?*, aka the missing join). In that sense, I don't see why the presence of an index should be taken into account.
pascal
+1  A: 

Well I don't know of any tool that did such a thing, but it seems to me that what would make a query more complicated would be measured by: the number of joins the number of where conditions the number of functions the number of subqueries the number of casts to differnt datatypes the number of case statements the number of loops or cursors the number of steps in a transaction

However, while it is true that the more comlex queries might appear to be the ones with the most possible defects, I find that the simple ones are very likely to contain defects as they are more likely to be written by someone who doesn't understand the data model and thus they may appear to work correctly, but in fact return the wrong data. So I'm not sure such a metric wouild tell you much.

HLGEM
Like any static code analysis, the usefulness would be limited. So I agree with what you are saying. But lets consider a situation where a single developer or three equally skilled developers wrote 20 queries. If it were possible to determine which queries were the most complex and hence most likly to contain defects, testing could focus first and/or most one those queries.Static code analyzers are never indicators or correctness, they are just indicators. They give you something else to sniff for 'code smells'.
epotter
+1  A: 

Common measures of software complexity include Cyclomatic Complexity (a measure of how complicated the control flow is) and Halstead complexity (a measure of complex the arithmetic is).

The "control flow" in a SQL query is best related to "and" and "or" operators in query.

The "computational complexity" is best related to operators such as SUM or implicit JOINS.

Once you've decided how to categorize each unit of syntax of a SQL query as to whether it is "control flow" or "computation", you can straightforwardly compute Cyclomatic or Halstead measures.

What the SQL optimizer does to queries I think is absolutely irrelevant. The purpose of complexity measures is to characterize how hard is to for a person to understand the query, not how how efficiently it can be evaluated.

Similarly, what the DDL says or whether views are involved or not shouldn't be included in such complexity measures. The assumption behind these metrics is that the complexity of machinery inside a used-abstraction isn't interesting when you simply invoke it, because presumably that abstraction does something well understood by the coder. This is why Halstead and Cyclomatic measures don't include called subroutines in their counting, and I think you can make a good case that views and DDL information are those "invoked" abstractractions.

Finally, how perfectly right or how perfectly wrong these complexity numbers are doesn't matter much, as long they reflect some truth about complexity and you can compare them relative to one another. That way you can choose which SQL fragments are the most complex, thus sort them all, and focus your testing attention on the most complicated ones.

Ira Baxter
As far as you know, does any such tool exist?
epotter
Well, sort of yes. My company offers a Source Code Search Engine (SCSE) (http://www.semanticdesigns.com/Products/SearchEngine) which scans over a set of files to prepare an index for searching. The SCSE happens to compute a number of simple metrics (SLOC, CommentCount, Cyclomatic, Halstead) over each file as a whole during the scan, *and* it will process many languages, including PLSQL. PLSQL of course has SQL as a sublanguage, and IIRC, SCSE computes software complexity numbers pretty much as I have described above. If you put your SQL fragments into files, the SCSE would probably do it.
Ira Baxter
... There is always the question of *where are your SQL fragments?* If they are embedded in string fragments in ODBC calls, extracting them and measuring them is going to be difficult because the parts are scattered across the code and it isn't instantly obvious that any particular string literal is part of a query or if so where it goes. If your SQL queries are embedded in stored procedure language such as PLSQL, they are obviously much easier to "extract". But the ideal tool in that case is one that measures the SQL queries seperately in situ so you don't have extract them by hand or hack.
Ira Baxter
... in this latter case, what you need is a tool to compute the complexity of fragements of the stored procedure file. My company also offers Metrics tools for many languages (but not presently and stored procedure language) that computes metrics over program elements (e.g. functions/methods) and summary rollups, based on a framework for computing such metrics over abstract syntax trees produced by parsing the source code. That metrics machinery could be focused on PLSQL or TSQL to produce probably exactly what you want, but it is custom work.
Ira Baxter