tags:

views:

146

answers:

9

I've just received a new data source for my application which inserts data into a Derby database only when it changes. Normally, missing data is fine - I'm drawing a line chart with the data (value over time), and I'd just draw a line between the two points, extrapolating the expected value at any given point. The problem is that as missing data in this case means "draw a straight line," the graph would be incorrect if I did this.

There are two ways I could fix this: I could create a new class that handles missing data differently (which could be difficult due to the way prefuse, the drawing library I'm using, handles drawing), or I could duplicate the rows, leaving the y value the same while changing the x value in each row. I could do this in the Java that bridges the database and the renderer, or I could modify the SQL.

My question is, given a result set like the one below:

+-------+---------------------+
| value | received            |
+-------+---------------------+
|     7 | 2000-01-01 08:00:00 |
|    10 | 2000-01-01 08:00:05 |
|    11 | 2000-01-01 08:00:07 |
|     2 | 2000-01-01 08:00:13 |
|     4 | 2000-01-01 08:00:16 |
+-------+---------------------+

Assuming I query it at 8:00:20, how can I make it look like the following using SQL? Basically, I'm duplicating the row for every second until it's already taken. received is, for all intents and purposes, unique (it's not, but it will be due to the WHERE clause in the query).

+-------+---------------------+
| value | received            |
+-------+---------------------+
|     7 | 2000-01-01 08:00:00 |
|     7 | 2000-01-01 08:00:01 |
|     7 | 2000-01-01 08:00:02 |
|     7 | 2000-01-01 08:00:03 |
|     7 | 2000-01-01 08:00:04 |
|    10 | 2000-01-01 08:00:05 |
|    10 | 2000-01-01 08:00:06 |
|    11 | 2000-01-01 08:00:07 |
|    11 | 2000-01-01 08:00:08 |
|    11 | 2000-01-01 08:00:09 |
|    11 | 2000-01-01 08:00:10 |
|    11 | 2000-01-01 08:00:11 |
|    11 | 2000-01-01 08:00:12 |
|     2 | 2000-01-01 08:00:13 |
|     2 | 2000-01-01 08:00:14 |
|     2 | 2000-01-01 08:00:15 |
|     4 | 2000-01-01 08:00:16 |
|     4 | 2000-01-01 08:00:17 |
|     4 | 2000-01-01 08:00:18 |
|     4 | 2000-01-01 08:00:19 |
|     4 | 2000-01-01 08:00:20 |
+-------+---------------------+

Thanks for your help.

A: 

One way to handle this is to left join your data against a table that contains all of the received values. Then, when there is no value for that row, you calculate what the projected value should be based on the previous and next actual values you have.

You didn't say what database platform you are using. In SQL Server, I would create a User Defined Function that accepts a start datetime and end datetime value. It would return a table value with all of the received values you need.

I have simulated it below, which runs in SQL Server. The subselect aliased r is what would actually get returned by the user defined function.

select r.received,
isnull(d.value,(select top 1 data.value from data where data.received < r.received order by data.received desc)) as x
from (
    select cast('2000-01-01 08:00:00' as datetime) received
    union all
    select cast('2000-01-01 08:00:01' as datetime)
    union all
    select cast('2000-01-01 08:00:02' as datetime)
    union all
    select cast('2000-01-01 08:00:03' as datetime)
    union all
    select cast('2000-01-01 08:00:04' as datetime)
    union all
    select cast('2000-01-01 08:00:05' as datetime)
    union all
    select cast('2000-01-01 08:00:06' as datetime)
    union all
    select cast('2000-01-01 08:00:07' as datetime)
    union all
    select cast('2000-01-01 08:00:08' as datetime)
    union all
    select cast('2000-01-01 08:00:09' as datetime)
    union all
    select cast('2000-01-01 08:00:10' as datetime)
    union all
    select cast('2000-01-01 08:00:11' as datetime)
    union all
    select cast('2000-01-01 08:00:12' as datetime)
    union all
    select cast('2000-01-01 08:00:13' as datetime)
    union all
    select cast('2000-01-01 08:00:14' as datetime)
    union all
    select cast('2000-01-01 08:00:15' as datetime)
    union all
    select cast('2000-01-01 08:00:16' as datetime)
    union all
    select cast('2000-01-01 08:00:17' as datetime)
    union all
    select cast('2000-01-01 08:00:18' as datetime)
    union all
    select cast('2000-01-01 08:00:19' as datetime)
    union all
    select cast('2000-01-01 08:00:20' as datetime)
) r
left outer join Data d on r.received = d.received
RedFilter
+3  A: 

Due to the set based nature of SQL, there's no simple way to do this. I have used two solution strategies:

a) use a cycle to go from the initial to end date time and for each step get the value, and insert that into a temp table

b) generate a table (normal or temporary) with the 1 minute increments, adding the base date time to this table you can generate the steps.

Example of approach b) (SQL Server version)

Let's assume we will never query more than 24 hours of data. We create a table intervals that has a dttm field with the minute count for each step. That table must be populated previously.

select dateadd(minute,stepMinutes,'2000-01-01 08:00') received,
(select top 1 value from table where received <= 
dateadd(minute,dttm,'2000-01-01 08:00') 
order by received desc) value
from intervals
tekBlues
A: 

Better would be to have a table for each axial value you want to have on the graph, and then either join to it or even just put the data field there and update that record when/if values arrive.

The "missing values" problem is quite extensive, so I suggest you have a solid policy.

One thing that will happen is that you will have multiple adjacent slots with missing values.

This would be much easier if you could transform it into OLAP data.

le dorfier
A: 

Create a simple table that has all the minutes (warning, will run for a while):

Create Table Minutes(Value DateTime Not Null)
Go

Declare @D DateTime
Set @D = '1/1/2000'

While (Year(@D) < 2002)
Begin
  Insert Into Minutes(Value) Values(@D)
  Set @D = DateAdd(Minute, 1, @D)
End
Go


Create Clustered Index IX_Minutes On Minutes(Value)
Go

You can then use it somewhat like this:

Select 
  Received = Minutes.Value,
  Value = (Select Top 1 Data.Value
           From Data
           Where Data.Received <= Minutes.Received
           Order By Data.Received Desc)
From
  Minutes
Where
  Minutes.Value Between @Start And @End
Stu
A: 

I would recommend against solving this in SQL/the database due to the set based nature of it. Also you are dealing with seconds here so I guess you could end up with a lot of rows, with the same repeated data, that would have to be transfered from the database to you application.

Jonas Elfström
+2  A: 

It seems like in this case you really don't need to generate all of these datapoints. Would it be correct to generate the following instead? If it's drawing a straight line, you don't need go generate a data point for each second, just two for each datapoint...one at the current time, one right before the next time. This example subtracts 5 ms from the next time, but you could make it a full second if you need it.

+-------+---------------------+
| value | received            |
+-------+---------------------+
|     7 | 2000-01-01 08:00:00 |
|     7 | 2000-01-01 08:00:04 |
|    10 | 2000-01-01 08:00:05 |
|    10 | 2000-01-01 08:00:06 |
|    11 | 2000-01-01 08:00:07 |
|    11 | 2000-01-01 08:00:12 |
|     2 | 2000-01-01 08:00:13 |
|     2 | 2000-01-01 08:00:15 |
|     4 | 2000-01-01 08:00:16 |
|     4 | 2000-01-01 08:00:20D |
+-------+---------------------+

If that's the case, then you can do the following:

SELECT * FROM
(SELECT * from TimeTable as t1
UNION
SELECT t2.value, dateadd(ms, -5, t2.received)
from ( Select t3.value, (select top 1 t4.received  
                         from TimeTable t4 
                         where t4.received > t3.received
                         order by t4.received asc) as received
from TimeTable t3) as t2
UNION
SELECT top 1 t6.value, GETDATE()
from TimeTable t6
order by t6.received desc
) as t5
where received IS NOT NULL
order by t5.received

The big advantage of this is that it is a set based solution and will be much faster than any iterative approach.

LPCRoy
This is definitely a better idea, and the SQL looks solid at first glance. Is there a way to do it without using TOP 1 though? Derby's pretty bare - it doesn't support TOP or LIMIT.
Samir Talwar
No, you can't. You need the top part of the query in order to get only the next item (e.g. the time after the current time)If you can move off of Derby onto MySQL or PostgreSql, i know both of these support this example. This was written in MSSQL so there would need to be some adaptation
LPCRoy
Damnit. We're embedding Derby, and so it's pretty intrinsic to our software - we can use another database, but we'd be using Derby as well, and that seems a little too much for what's supposed to be a simple app. It seems I'll be doing this in the Java code instead. Nevertheless, I'm accepting your answer, as you've basically answered the question for anyone using a half-decent database.
Samir Talwar
+1  A: 

You could just walk a cursor, keep vars for the last value & time returned, and if the current one is more than a second ahead, loop one second at a time using the previous value and the new time until you get the the current row's time.

Trying to do this in SQL would be painful, and if you went and created the missing data, you would possible have to add a column to track real / interpolated data points.

Kevin K
A: 

If you were in SQL Server, then this would be a good start. I am not sure how close Apache's Derby is to sql.

Usage: EXEC ElaboratedData '2000-01-01 08:00:00','2000-01-01 08:00:20'

CREATE PROCEDURE [dbo].[ElaboratedData]
  @StartDate DATETIME,
  @EndDate DATETIME
AS
  --if not a valid interval, just quit
  IF @EndDate<=@StartDate BEGIN
    SELECT 0;    
    RETURN;
  END;

  /*
  Store the value of 1 second locally, for readability
  --*/
  DECLARE @OneSecond FLOAT;
  SET @OneSecond = (1.00000000/86400.00000000);

  /*
  create a temp table w/the same structure as the real table.
  --*/
  CREATE TABLE #SecondIntervals(TSTAMP DATETIME, DATAPT INT);

  /*
  For each second in the interval, check to see if we have a known value.
  If we do, then use that.  If not, make one up.
  --*/ 
  DECLARE @CurrentSecond DATETIME; 
  SET @CurrentSecond = @StartDate;
  WHILE @CurrentSecond <= @EndDate BEGIN
    DECLARE @KnownValue INT;

    SELECT @KnownValue=DATAPT
    FROM TESTME
    WHERE TSTAMP = @CurrentSecond;

    IF (0 = ISNULL(@KnownValue,0)) BEGIN
      --ok, we have to make up a fake value
      DECLARE @MadeUpValue INT;
      /*
      *******Put whatever logic you want to make up a fake value here
      --*/
      SET @MadeUpValue = 99;

      INSERT INTO #SecondIntervals(
        TSTAMP
       ,DATAPT
      )
      VALUES(
        @CurrentSecond
       ,@MadeUpValue
      );
    END;  --if we had to make up a value
    SET @CurrentSecond = @CurrentSecond + @OneSecond;
  END;  --while looking thru our values

  --finally, return our generated values + real values
  SELECT TSTAMP, DATAPT FROM #SecondIntervals
  UNION ALL
  SELECT TSTAMP, DATAPT FROM TESTME
  ORDER BY TSTAMP;
GO
JosephStyons
A: 

As just an idea, you might want to check out Anthony Mollinaro's SQL Cookbook, chapter 9. He has a recipe, "Filling in Missing Dates" (check out pages 278-281), that discusses primarily what you are trying to do. It requires some sort of sequential handling, either via a helper table or doing the query recursively. While he doesn't have examples for Derby directly, I suspect you could probably adapt them to your problem (particularly the PostgreSQL or MySQL one, it seems somewhat platform agnostic).

sheepsimulator