views:

286

answers:

2

I've got a table of temperature samples over time from several sources and I want to find the minimum, maximum, and average temperatures across all sources at set time intervals. At first glance this is easily done like so:

SELECT MIN(temp), MAX(temp), AVG(temp) FROM samples GROUP BY time;

However, things become much more complicated (to the point of where I'm stumped!) if sources drop in and out and rather than ignoring the missing sources during the intervals in question I want to use the sources' last know temperatures for the missing samples. Using datetimes and constructing intervals (say every minute) across samples unevenly distributed over time further complicates things.

I think it should be possible to create the results I want by doing a self-join on the samples table where the time from the first table is greater than or equal to the time of the second table and then calculating aggregate values for rows grouped by source. However, I'm stumped about how to actually do this.

Here's my test table:

+------+------+------+
| time   | source  | temp |
+------+------+------+
|    1 | a    |   20 | 
|    1 | b    |   18 | 
|    1 | c    |   23 | 
|    2 | b    |   21 | 
|    2 | c    |   20 | 
|    2 | a    |   18 | 
|    3 | a    |   16 | 
|    3 | c    |   13 | 
|    4 | c    |   15 | 
|    4 | a    |    4 | 
|    4 | b    |   31 | 
|    5 | b    |   10 | 
|    5 | c    |   16 | 
|    5 | a    |   22 | 
|    6 | a    |   18 | 
|    6 | b    |   17 | 
|    7 | a    |   20 | 
|    7 | b    |   19 | 
+------+------+------+
INSERT INTO samples (time, source, temp) VALUES (1, 'a', 20), (1, 'b', 18), (1, 'c', 23), (2, 'b', 21), (2, 'c', 20), (2, 'a', 18), (3, 'a', 16), (3, 'c', 13), (4, 'c', 15), (4, 'a', 4), (4, 'b', 31), (5, 'b', 10), (5, 'c', 16), (5, 'a', 22), (6, 'a', 18), (6, 'b', 17), (7, 'a', 20), (7, 'b', 19);

To do my min, max and avg calculations, I want an intermediate table that looks like this:

+------+------+------+
| time   | source  | temp |
+------+------+------+
|    1 | a    |   20 | 
|    1 | b    |   18 | 
|    1 | c    |   23 | 
|    2 | b    |   21 | 
|    2 | c    |   20 | 
|    2 | a    |   18 | 
|    3 | a    |   16 | 
|    3 | b    |   21 | 
|    3 | c    |   13 | 
|    4 | c    |   15 | 
|    4 | a    |    4 | 
|    4 | b    |   31 | 
|    5 | b    |   10 | 
|    5 | c    |   16 | 
|    5 | a    |   22 | 
|    6 | a    |   18 | 
|    6 | b    |   17 | 
|    6 | c    |   16 | 
|    7 | a    |   20 | 
|    7 | b    |   19 | 
|    7 | c    |   16 | 
+------+------+------+

The following query is getting me close to what I want but it takes the temperature value of the source's first result, rather than the most recent one at the given time interval:

SELECT s.dt as sdt, s.mac, ss.temp, MAX(ss.dt) as maxdt FROM (SELECT DISTINCT dt FROM samples) AS s CROSS JOIN samples AS ss WHERE s.dt >= ss.dt GROUP BY sdt, mac HAVING maxdt <= s.dt ORDER BY sdt ASC, maxdt ASC;

+------+------+------+-------+
| sdt  | mac  | temp | maxdt |
+------+------+------+-------+
|    1 | a    |   20 |     1 | 
|    1 | c    |   23 |     1 | 
|    1 | b    |   18 |     1 | 
|    2 | a    |   20 |     2 | 
|    2 | c    |   23 |     2 | 
|    2 | b    |   18 |     2 | 
|    3 | b    |   18 |     2 | 
|    3 | a    |   20 |     3 | 
|    3 | c    |   23 |     3 | 
|    4 | a    |   20 |     4 | 
|    4 | c    |   23 |     4 | 
|    4 | b    |   18 |     4 | 
|    5 | a    |   20 |     5 | 
|    5 | c    |   23 |     5 | 
|    5 | b    |   18 |     5 | 
|    6 | c    |   23 |     5 | 
|    6 | a    |   20 |     6 | 
|    6 | b    |   18 |     6 | 
|    7 | c    |   23 |     5 | 
|    7 | b    |   18 |     7 | 
|    7 | a    |   20 |     7 | 
+------+------+------+-------+

Update: chadhoc (great name, by the way!) gives a nice solution that unfortunately does not work in MySQL, since it does not support the FULL JOIN he uses. Luckily, I believe a simple UNION is an effective replacement:

-- Unify the original samples with the missing values that we've calculated
(
  SELECT time, source, temp
  FROM samples
)
UNION
( -- Pull all the time/source combinations that we are missing from the sample set, along with the temp
  -- from the last sampled interval for the same time/source combination if we do not have one
  SELECT  a.time, a.source, (SELECT t2.temp FROM samples AS t2 WHERE t2.time < a.time AND t2.source = a.source ORDER BY t2.time DESC LIMIT 1) AS temp
  FROM    
  ( -- All values we want to get should be a cross of time/temp
    SELECT t1.time, s1.source
    FROM
    (SELECT DISTINCT time FROM samples) AS t1
    CROSS JOIN
    (SELECT DISTINCT source FROM samples) AS s1
  ) AS a
  LEFT JOIN samples s
  ON a.time = s.time
  AND a.source = s.source
  WHERE s.source IS NULL
)
ORDER BY time, source;

Update 2: MySQL gives the following EXPLAIN output for chadhoc's code:

+----+--------------------+------------+------+---------------+------+---------+------+------+-----------------------------+
| id | select_type        | table      | type | possible_keys | key  | key_len | ref  | rows | Extra                       |
+----+--------------------+------------+------+---------------+------+---------+------+------+-----------------------------+
|  1 | PRIMARY            | temp       | ALL  | NULL          | NULL | NULL    | NULL |   18 |                             | 
|  2 | UNION              | <derived4> | ALL  | NULL          | NULL | NULL    | NULL |   21 |                             | 
|  2 | UNION              | s          | ALL  | NULL          | NULL | NULL    | NULL |   18 | Using where                 | 
|  4 | DERIVED            | <derived6> | ALL  | NULL          | NULL | NULL    | NULL |    3 |                             | 
|  4 | DERIVED            | <derived5> | ALL  | NULL          | NULL | NULL    | NULL |    7 |                             | 
|  6 | DERIVED            | temp       | ALL  | NULL          | NULL | NULL    | NULL |   18 | Using temporary             | 
|  5 | DERIVED            | temp       | ALL  | NULL          | NULL | NULL    | NULL |   18 | Using temporary             | 
|  3 | DEPENDENT SUBQUERY | t2         | ALL  | NULL          | NULL | NULL    | NULL |   18 | Using where; Using filesort | 
| NULL | UNION RESULT       | <union1,2> | ALL  | NULL          | NULL | NULL    | NULL | NULL | Using filesort              | 
+----+--------------------+------------+------+---------------+------+---------+------+------+-----------------------------+

I was able to get Charles' code working like so:

SELECT T.time, S.source,
  COALESCE(
    D.temp,
    (
      SELECT temp FROM samples
      WHERE source = S.source AND time = (
        SELECT MAX(time)
        FROM samples
        WHERE
          source = S.source
          AND time < T.time
      )
    )
  ) AS temp
FROM (SELECT DISTINCT time FROM samples) AS T
CROSS JOIN (SELECT DISTINCT source FROM samples) AS S
  LEFT JOIN samples AS D
ON D.source = S.source AND D.time = T.time

Its explanation is:

+----+--------------------+------------+------+---------------+------+---------+------+------+-----------------+
| id | select_type        | table      | type | possible_keys | key  | key_len | ref  | rows | Extra           |
+----+--------------------+------------+------+---------------+------+---------+------+------+-----------------+
|  1 | PRIMARY            | <derived5> | ALL  | NULL          | NULL | NULL    | NULL |    3 |                 | 
|  1 | PRIMARY            | <derived4> | ALL  | NULL          | NULL | NULL    | NULL |    7 |                 | 
|  1 | PRIMARY            | D          | ALL  | NULL          | NULL | NULL    | NULL |   18 |                 | 
|  5 | DERIVED            | temp       | ALL  | NULL          | NULL | NULL    | NULL |   18 | Using temporary | 
|  4 | DERIVED            | temp       | ALL  | NULL          | NULL | NULL    | NULL |   18 | Using temporary | 
|  2 | DEPENDENT SUBQUERY | temp       | ALL  | NULL          | NULL | NULL    | NULL |   18 | Using where     | 
|  3 | DEPENDENT SUBQUERY | temp       | ALL  | NULL          | NULL | NULL    | NULL |   18 | Using where     | 
+----+--------------------+------------+------+---------------+------+---------+------+------+-----------------+
+1  A: 

I think you'll get better performance making use of the ranking/windowing functions in mySql, but unfortunately I do not know those as well as the TSQL implementation. Here is an ANSI compliant solution that will work though:

-- Full join across the sample set and anything missing from the sample set, pulling the missing temp first if we do not have one
select coalesce(c1.[time], c2.[time]) as dt, coalesce(c1.source, c2.source) as source, coalesce(c2.temp, c1.temp) as temp
from samples c1
full join ( -- Pull all the time/source combinations that we are missing from the sample set, along with the temp
   -- from the last sampled interval for the same time/source combination if we do not have one
   select a.time, a.source,
     (select top 1 t2.temp from samples t2 where t2.time < a.time and t2.source = a.source order by t2.time desc) as temp
   from 
    ( -- All values we want to get should be a cross of time/samples
     select t1.[time], s1.source
     from
     (select distinct [time] from samples) as t1
     cross join
     (select distinct source from samples) as s1
    ) a
   left join samples s
   on a.[time] = s.time
   and a.source = s.source
   where s.source is null
  ) c2
on c1.time = c2.time
and c1.source = c2.source
order by dt, source
chadhoc
A: 

I know this looks complicated, but it's formatted to explain itself... It should work... Hope you only have three sources... If you have an arbitrary number of sources than this won't work... In that case see the second query... EDIT: Removed first attempt

EDIT: If you don't know the sources ahead of time, you'll have to do something where you create an intermediate result set that "Fills in" the missing values.. something like this:

2nd EDIT: Removed need for Coalesce by moving logic to retrieve most recent temp reading for each source from Select clause into the Join condition.

Select T.Time, Max(Temp) MaxTemp,
  Min(Temp) MinTemp, Avg(Temp) AvgTemp
From
  (Select T.TIme, S.Source, D.Temp
   From (Select Distinct Time From Samples) T
     Cross Join 
        (Select Distinct Source From Samples) S
     Left Join Samples D
        On D.Source = S.Source
           And D.Time = 
               (Select Max(Time)
                From Samples
                Where Source = S.Source
                   And Time <= T.Time)) Z
Group By T.Time
Charles Bretana
Thanks, Charles, but your solution assumes the sources are all known ahead of time. Do you have any suggestions when they are not known?
pr1001
Added another sql query for if you don't know the sources...
Charles Bretana
After changing IsNull to COALESCE I was able to get the query to work on my MySQL database. Thanks.
pr1001
@pr1001, ahh, MySQl does not have IsNull? I edited sql to make that change...
Charles Bretana