views:

121

answers:

2

There must be a better way of writing this query.

I want to select all the data between a pair of dates. Ideally the first and last rows of the result set would be those specifed in the WHERE clause. If those rows don't exist, I want the rows preceeding and following the requested range.

An example:

If my data is:

...
135321, 20090311 10:15:00
135321, 20090311 10:45:00
135321, 20090311 11:00:00
135321, 20090311 11:15:00
135321, 20090311 11:30:00
135321, 20090311 12:30:00
...

And the query is:

    SELECT * 
    FROM data_bahf 
    WHERE param_id = 135321 
    AND datetime >= '20090311 10:30:00' 
    AND datetime <= '20090311 12:00:00'

I want the returned data to include the row at 10:15, and that of 12:30. Not just those that strictly meet the WHERE clause.

This is the best I've come up with.

SELECT * FROM (
    SELECT * 
    FROM data_bahf 
    WHERE param_id = 135321 
    AND datetime > '20090311 10:30:00' 
    AND datetime < '20090311 12:00:00'

    UNION

    (
        SELECT * FROM data_bahf 
        WHERE param_id = 135321 
        AND datetime <= '20090311 10:30:00' 
        ORDER BY datetime desc
        LIMIT 1
    )

    UNION

    (
        SELECT * FROM data_bahf 
        WHERE param_id = 135321 
        AND datetime >= '20090311 12:00:00'
        ORDER BY datetime asc
        LIMIT 1
    )
) 
AS A
ORDER BY datetime

(Ignore the use of SELECT * for now)

EDIT: I have indexes on param_id, datetime, and (param_id, datetime)

+3  A: 

I'd say this:

SELECT 
  o.* 
FROM 
  data_bahf o
WHERE 
  o.param_id = 135321 
  AND o.datetime BETWEEN
  ISNULL(
    (
      SELECT   MAX(datetime) 
      FROM     data_bahf i
      WHERE    i.param_id = 135321 AND i.datetime <= '20090311 10:30:00'
    ),
    '0001-01-01 00:00:00'
  )
  AND
  ISNULL(
    (
      SELECT   MIN(datetime) 
      FROM     data_bahf i
      WHERE    i.param_id = 135321 AND i.datetime >= '20090311 12:00:00'
    ),
    '9999-12-31 23:59:59'
  )

EDIT: Fallback added.
When there is no row matching the sub-query, it will result in a NULL value, which must be caught by ISNULL() or the BETWEEN operator will fail and the main query will return no rows at all.

Tomalak
It does look better - but it's 10 times slower!
Johan
Do you have an index on your datetime field? (I take it for granted that you have one on your param_id field.)
Tomalak
I do, and the query is using it. The times are 10ms for the original query, and 200ms for the new one. In absolute terms this isn't much, but I do have many of these queries to run. There are only about 200k rows.
Johan
I'm afraid I am no PostgreSQL expert. For 200k rows the performance *should* be very snappy (just tested it on SQL server on a similar amount of live data, it's blazingly fast) Can you post the query execution plans (EXPLAIN) of both variants?
Tomalak
Postgres cannot cache dependent inner queries, it will reevaluate the on each join. That's why your query is slow. To make it fast, you'll need either to pass the constant into each of the inner queries, or wrap the correlated subqueries as the results of an uncorrelated one.
Quassnoi
And I thought it would be a clever move to eliminate the constant in the inner query to make sure you only need to change it in one place... I'll change it to constants in the inner queries.
Tomalak
After making the changes, it now runs at <8ms. Looks better and runs better! Thanks.
Johan
Actually, it is Quassnoi's expertise that made this possible. I had the right idea, but he knew the missing bits. I think it's fair to accept his answer instead of mine.
Tomalak
I don't really mind which I accept, so... done.
Johan
+2  A: 

First, make sure that you have a composite index on (param_id, datetime)

Second, query like this:

SELECT  *
FROM    data_bahf
WHERE   param_id = 135321
        AND datetime BETWEEN
        COALESCE(
        (
        SELECT  MAX(datetime)
        FROM    data_bahf
        WHERE   param_id = 135321
              AND datetime <= '2009-01-01 00:00:00'
        ), '0001-01-01')
        AND 
        COALESCE(
        (
        SELECT  MIN(datetime)
        FROM    data_bahf
        WHERE   param_id = 135321
              AND datetime >= '2009-01-02 00:00:00'
        ), '9999-01-01')

Just checked, it runs in 1.215 ms for a sample table of 200,000 rows

Quassnoi
Can you say why having a composite index is beneficial? (no critique, I'm just wondering)
Tomalak
Because if you don't have one, you'll need to filter out all non-matching param_id's while searching for MAX and MIN.
Quassnoi
When you already have an index on "param_id" (which I took as read for this question) I would expect the optimizer to apply this before it is doing MAX() or MIN() on the remaining rows. So I would think having two separate indexes to yield the same performance. (Or not?)
Tomalak
No, they won't. Optimizer can use either an index on param_id, in which case it will need to scan ALL rows with this param_id to find the MIN datetime; or an index on datetime, in which case it will need to scan all rows in descending order until it finds a first matching param_id
Quassnoi
When you have a composite index, you will need to find one single leaf to the left of a given (param_id, datetime) pair.
Quassnoi
Thanks for sharing the insight! :-) +1
Tomalak
Thanks for the help. I already had the composite index, but wasn't sure if I was over indexing or not as it never seemed to be used in the query plan.
Johan