views:

484

answers:

1

Hi I am faced with a difficult problem:

I have a table (oracle 9i) of weather forecasts (many 100's of millions of records in size.) whose makeup looks like this:

stationid    forecastdate    forecastinterval    forecastcreated    forecastvalue
---------------------------------------------------------------------------------
varchar (pk) datetime (pk)   integer (pk)        datetime (pk)      integer

where:

  • stationid refers to one of the many weather stations that may create a forecast;
  • forecastdate refers to the date the forecast is for (date only not time.)
  • forecastinterval refers to the hour in the forecastdate for the forecast (0 - 23).
  • forecastcreated refers to the time the forecast was made, can be many days beforehand.
  • forecastvalue refers to the actual value of the forecast (as the name implies.)

I need to determine for a given stationid and a given forecastdate and forecastinterval pair, the records where a forecastvalue increments more than a nominal number (say 500). I'll show a table of the condition here:

stationid    forecastdate    forecastinterval    forecastcreated    forecastvalue
---------------------------------------------------------------------------------
'stationa'   13-dec-09       10                  10-dec-09 04:50:10  0
'stationa'   13-dec-09       10                  10-dec-09 17:06:13  0
'stationa'   13-dec-09       10                  12-dec-09 05:20:50  300
'stationa'   13-dec-09       10                  13-dec-09 09:20:50  300

In the above scenario, I'd like to pull out the third record. This is the record where the forecast value increased by a nominal (say 100) amount.

The task is proving to be very difficult due to the sheer size of the table (many 100s of millions of records.) and taking so long to finish (so long in fact that my query has never returned.)

Here is my attempt so far to grab these values:

select
    wtr.stationid,
    wtr.forecastcreated,
    wtr.forecastvalue,
    (wtr.forecastdate + wtr.forecastinterval / 24) fcst_date
from
    (select inner.*
            rank() over (partition by stationid, 
                                   (inner.forecastdate + inner.forecastinterval),
                                   inner.forecastcreated
                         order by stationid, 
                                  (inner.forecastdate + inner.forecastinterval) asc,
                                  inner.forecastcreated asc
            ) rk
      from weathertable inner) wtr 
      where
      wtr.forecastvalue - 100 > (
                     select lastvalue
                      from (select y.*,
                            rank() over (partition by stationid, 
                                            (forecastdate + forecastinterval),
                                            forecastcreated
                                         order by stationid, 
                                           (forecastdate + forecastinterval) asc,
                                           forecastcreated asc) rk
                             from weathertable y
                            ) z
                       where z.stationid = wtr.stationid
                             and z.forecastdate = wtr.forecastdate                                                   
                             and (z.forecastinterval =    
                                         wtr.forecastinterval)
/* here is where i try to get the 'previous' forecast value.*/
                             and wtr.rk = z.rk + 1)
+1  A: 

Rexem's suggestion of using LAG() is the right approach but we need to use a partitioning clause. This becomes clear once we add rows for different intervals and different stations...

SQL> select * from t
  2  /    
STATIONID  FORECASTDATE INTERVAL FORECASTCREATED     FORECASTVALUE
---------- ------------ -------- ------------------- -------------
stationa   13-12-2009         10 10-12-2009 04:50:10             0
stationa   13-12-2009         10 10-12-2009 17:06:13             0
stationa   13-12-2009         10 12-12-2009 05:20:50           300
stationa   13-12-2009         10 13-12-2009 09:20:50           300
stationa   13-12-2009         11 13-12-2009 09:20:50           400
stationb   13-12-2009         11 13-12-2009 09:20:50           500

6 rows selected.

SQL> SELECT v.stationid,
  2         v.forecastcreated,
  3         v.forecastvalue,
  4         (v.forecastdate + v.forecastinterval / 24) fcst_date
  5    FROM (SELECT t.stationid,
  6                 t.forecastdate,
  7                 t.forecastinterval,
  8                 t.forecastcreated,
  9                 t.forecastvalue,
 10                 t.forecastvalue - LAG(t.forecastvalue, 1)
 11                      OVER (ORDER BY t.forecastcreated) as difference
 12            FROM t) v
 13   WHERE v.difference >= 100
 14  /    
STATIONID  FORECASTCREATED     FORECASTVALUE FCST_DATE
---------- ------------------- ------------- -------------------
stationa   12-12-2009 05:20:50           300 13-12-2009 10:00:00
stationa   13-12-2009 09:20:50           400 13-12-2009 11:00:00
stationb   13-12-2009 09:20:50           500 13-12-2009 11:00:00

SQL>

To remove the false positives we group the LAG() by STATIONID, FORECASTDATE and FORECASTINTERVAL. Note that the following relies on the inner query returning NULL from the first calculation of each partition window.

SQL> SELECT v.stationid,
  2         v.forecastcreated,
  3         v.forecastvalue,
  4         (v.forecastdate + v.forecastinterval / 24) fcst_date
  5    FROM (SELECT t.stationid,
  6                 t.forecastdate,
  7                 t.forecastinterval,
  8                 t.forecastcreated,
  9                 t.forecastvalue,
 10                 t.forecastvalue - LAG(t.forecastvalue, 1)
 11                      OVER (PARTITION BY t.stationid
 12                                         , t.forecastdate
 13                                         , t.forecastinterval
 14                            ORDER BY t.forecastcreated) as difference
 15            FROM t) v
 16   WHERE v.difference >= 100
 17  /

STATIONID  FORECASTCREATED     FORECASTVALUE FCST_DATE
---------- ------------------- ------------- -------------------
stationa   12-12-2009 05:20:50           300 13-12-2009 10:00:00

SQL>

Working with large volumes of data

You describe your tables as containing many hundreds of millions of rows. Such huge tables are like black holes, they have a different physics. There are various potential approaches, depending on your needs, timescales, finances, database version and edition, and any other usage of your system's data. It's more than a five minute answer.

But here's the five minute answer anyway.

Assuming your table is the live table it is presumably being populating by adding forecasts as they occur, which is basically an appending operation. This would mean forecasts for any given station are scattered throughout the table. Consequently indexes on just STATIONID or even FORECASTDATE would have a poor clustering factor.

On that assumption, the one thing I would suggest you try first is building an index on (STATIONID, FORCASTDATE, FORECASTINTERVAL, FORECASTCREATED, FORECASTVALUE). This will take some time (and disk space) to build, but it ought to speed up your subsequent queries quite considerably, because it has all the columns needed to satisfy the query with an INDEX RANGE SCAN without touching the table at all.

APC
Thanks for clearing up on the `partition by`. I've run these solutions a few times now. The tables are so huge that to get an immediate response I've placed a restriction of the last 10 days in my where clause to force a small index search e.g (select * from t where t.forecastdate > sysdate - 10) tNot an ideal solution, I guess the alternative is to force a full table scan using /*+ full(t) */ (considering that it seems indexing is taking too long even for 10 days using the previous approach.)?
Simon Edwards
@APC: thx, seldom get to use LEAD/LAG.
OMG Ponies