views:

308

answers:

8

Hi ,

What's better way to format following sql statement considering both readability and performance. Thanks.

sql = (char *)" SELECT * ,rowid FROM tblEvent_basic "
                  " WHERE "
                  " service_id = ? AND "
                  " ("
                  " (start_time >= ? AND start_time <= ?) OR "
                  " (end_time > ? AND end_time <?) OR "
                  " (start_time < ? AND end_time > ?)"
                  " )"
                  " ORDER by start_time ASC";

EDIT: 1.sqlite3 Database engine ; using C API ;running on MIPS24K 250M Embedded CPU.

2.the 2nd , 4th , 6th parameter is same , and 3rd, 5th, 7th is same.

  rc = sqlite3_bind_int(sql_stmt,1,service_id);
  rc = sqlite3_bind_text(sql_stmt,2,ts.start, 5, SQLITE_TRANSIENT);
  rc = sqlite3_bind_text(sql_stmt,3,ts.end  , 5, SQLITE_TRANSIENT);
  rc = sqlite3_bind_text(sql_stmt,4,ts.start, 5, SQLITE_TRANSIENT);
  rc = sqlite3_bind_text(sql_stmt,5,ts.end  , 5, SQLITE_TRANSIENT);
  rc = sqlite3_bind_text(sql_stmt,6,ts.start, 5, SQLITE_TRANSIENT);
  rc = sqlite3_bind_text(sql_stmt,7,ts.end  , 5, SQLITE_TRANSIENT);
+3  A: 

For starters, you could use BETWEEN instead of >= and <=. This would make the query much more readable, without any impact on performance. As far as optimizing the query for performance, you should look into using your database's equivalent of the EXPLAIN plan to give you some pointers on where your query is taking most of its time.

B.R.
+1 on EXPLAIN. Only the `start_time` range is inclusive however, so it would not improve the readability all that much.
Thorarin
+2  A: 

StartTime and EndTime should both be indexed - since all of the filtering and ordering is done based on those values, that's going to be important.

I'd also second using the BETWEEN statement if your SQL engine supports it. However, BETWEEN is usually inclusive (it is in SQL Server, anyway), so it may only work for your first date filter, since the others use < and >.

rwmnau
A: 

I recommend not using "SELECT *", that's typically CPU/time/whatever more consuming than explicitely listing the fields you want to have, and it is more readable for someone else because you don't have to remember which are the fields contained in the table.

+1 for the BETWEEN it will impact on performance, making your query faster.

Clement Herreman
how does BETWEEN make the query faster?
Thilo
A: 

Hmm... First off, no hardcoded queries in sourcecode. However, if you really do need this, check if your programming language of choice supports multiline strings or blocks (or whatever you might call that). For example, in Ruby:

sql = <<BLOCK

SELECT * ,rowid FROM tblEvent_basic 
WHERE 
service_id = ? AND 
(
(start_time >= ? AND start_time <= ?) OR 
(end_time >= ? AND end_time <?) OR 
(start_time < ? AND end_time > ?)
)
ORDER by start_time ASC;

BLOCK

or in C#:

sql = @"SELECT * ,rowid FROM tblEvent_basic 
WHERE 
service_id = ? AND 
(
(start_time >= ? AND start_time <= ?) OR 
(end_time >= ? AND end_time <?) OR 
(start_time < ? AND end_time > ?)
)
ORDER by start_time ASC;"
Anton Gogolev
I read that some query optimizers will actually be more efficient with literal queries instead of parameterized ones. I read this about Oracle once. Haven't verified it, but it was food for thought. Might have to do with caching mechanisms?
maxwellb
The change the was suggested here (moving the SQL out of the source code) has nothing to do with making it more or less literal. In both cases it is completely parameterized with bind variables. As to literal queries being more efficient sometimes, this can be true if there is significant skew in the data, and using a literal to check histogram information can lead to a more efficient query plan than the generalized version for a bind variable.
Thilo
Sometimes what happens is the cached query plan for a parameterized query uses the first set of parameters you use, which may end up with a pathologically bad query plan due to those particular parameters and then continues to use that query plan for all subsequent parameters.
thelsdj
@thelsdj: Yes, (in Oracle) that is a side-effect of the parameter-peek "feature".
Thilo
A: 

Formating (linebreaks, indenting, ...) will have no impact on performance. Except if you put tons (I mean like thousands/millions of unnecessary spaces) of whitespace that could significantly delay query transmision. Compiler will compile whole query as a single constant anyway.

maciejkow
readability. meh. i guess that was more of a personal preference
maxwellb
Of course it is, and this is why I emphasize that formating a query to be more readable will have no influence on performance.
maciejkow
A: 

Do you have any constraints you are imposing on the parameters? You could also optimize the query by removing unnecessary parameters if you have constraints which make the specifications listed in double.

Such as, your query is logically equivalent to:

"SELECT *,rowid FROM tblEvent_basic WHERE service_id = ? AND ( \
     end_time != ? AND \
     end_time > ? ) \
ORDER BY start_time ASC;"

with

rc = sqlite3_bind_int( sql_stmt,1,service_id);
rc = sqlite3_bind_text(sql_stmt,2,ts.end, 5, SQLITE_TRANSIENT);
rc = sqlite3_bind_text(sql_stmt,3,ts.start, 5, SQLITE_TRANSIENT);

.. with the assumption that ts.start <= ts.end. Application logic can often save the work of the database engine, if you have well-defined parameters.

maxwellb
Conversely, you could also optimize the query by introducing redundant constraints. It all depends on what indexes are present. In case of a "select *" he needs all column data anyway, so that the checks will not be expensive.
Thilo
A: 

Spcify which columns you need, select * should not be used in production code. By sending only the columns you need, performance will improve. Right now, rowid is returned twice and thus at least some of what you are returning is wasteful of database and network resources.

HLGEM
+4  A: 

Your temporal condition is currently:

            " (start_time >= ? AND start_time <= ?) OR "
            " (end_time > ? AND end_time <?) OR "
            " (start_time < ? AND end_time > ?)"

We can immediately improve the readability (in constant width fonts) with some spaces:

            " (start_time >= ? AND start_time <= ?) OR "
            " (end_time    > ? AND end_time    < ?) OR "
            " (start_time  < ? AND end_time    > ?)"

And from the commentary, we know that the same value will be passed to the placeholders 1, 3, 5, and a different value will be passed to placeholders 2, 4, 6 (but they all get the same value too). Further, if we call those times t1 and t2, then we can assume that t1 <= t2.

So, what is this criterion looking for?

  • start time falls in the range t1..t2
  • end time falls in the range t1..t2
  • start time is earlier than t1 and end time is later than t2

This is an overlaps criterion written the hard way - it should be replaced by:

            "(start_time <= ? AND end_time >= ?)"

Except that placeholder one here corresponds to t2 and placeholder two corresponds to t1. If you don't want events that meet the time range to be counted (that is, you do not want to count an event that ended at the instant t1, or an event that started at the instant t2), then change the '>=' and '<=' into '>' and '<' respectively.

This is the standard way of writing the overlaps predicate when including the end times. The condition is much simpler - no OR terms - and is reliable. It will be faster in that the optimizer has less work to do, and possibly that the execution engine will have fewer criteria to apply. (A really good optimizer might spot the equivalence of the 2-placeholder and 6-placeholder versions, but I wouldn't want to bet on it doing so - not least because the optimizer cannot tell that placeholders 1,3,5 will be the same, nor that placeholders 2,4,6 will be the same; that can only be determined if it bothered to reoptimize when the statement is executed.)

Jonathan Leffler
+1 For answering the unasked question. I like that you optimized out the relevance of some of the OR's. Otherwise one need not include start_time at all.
maxwellb
Oh, and assuming that end time can be inclusive on the end_time. I suspect this is what OP was going for, but was starting from the wrong angle.
maxwellb