ansaurus

Question

Oracle: Find previous record for a ranked list of forecasts

Answer 1

+1 A:

Rexem's suggestion of using LAG() is the right approach but we need to use a partitioning clause. This becomes clear once we add rows for different intervals and different stations...

SQL> select * from t
  2  /    
STATIONID  FORECASTDATE INTERVAL FORECASTCREATED     FORECASTVALUE
---------- ------------ -------- ------------------- -------------
stationa   13-12-2009         10 10-12-2009 04:50:10             0
stationa   13-12-2009         10 10-12-2009 17:06:13             0
stationa   13-12-2009         10 12-12-2009 05:20:50           300
stationa   13-12-2009         10 13-12-2009 09:20:50           300
stationa   13-12-2009         11 13-12-2009 09:20:50           400
stationb   13-12-2009         11 13-12-2009 09:20:50           500

6 rows selected.

SQL> SELECT v.stationid,
  2         v.forecastcreated,
  3         v.forecastvalue,
  4         (v.forecastdate + v.forecastinterval / 24) fcst_date
  5    FROM (SELECT t.stationid,
  6                 t.forecastdate,
  7                 t.forecastinterval,
  8                 t.forecastcreated,
  9                 t.forecastvalue,
 10                 t.forecastvalue - LAG(t.forecastvalue, 1)
 11                      OVER (ORDER BY t.forecastcreated) as difference
 12            FROM t) v
 13   WHERE v.difference >= 100
 14  /    
STATIONID  FORECASTCREATED     FORECASTVALUE FCST_DATE
---------- ------------------- ------------- -------------------
stationa   12-12-2009 05:20:50           300 13-12-2009 10:00:00
stationa   13-12-2009 09:20:50           400 13-12-2009 11:00:00
stationb   13-12-2009 09:20:50           500 13-12-2009 11:00:00

SQL>

To remove the false positives we group the LAG() by STATIONID, FORECASTDATE and FORECASTINTERVAL. Note that the following relies on the inner query returning NULL from the first calculation of each partition window.

SQL> SELECT v.stationid,
  2         v.forecastcreated,
  3         v.forecastvalue,
  4         (v.forecastdate + v.forecastinterval / 24) fcst_date
  5    FROM (SELECT t.stationid,
  6                 t.forecastdate,
  7                 t.forecastinterval,
  8                 t.forecastcreated,
  9                 t.forecastvalue,
 10                 t.forecastvalue - LAG(t.forecastvalue, 1)
 11                      OVER (PARTITION BY t.stationid
 12                                         , t.forecastdate
 13                                         , t.forecastinterval
 14                            ORDER BY t.forecastcreated) as difference
 15            FROM t) v
 16   WHERE v.difference >= 100
 17  /

STATIONID  FORECASTCREATED     FORECASTVALUE FCST_DATE
---------- ------------------- ------------- -------------------
stationa   12-12-2009 05:20:50           300 13-12-2009 10:00:00

SQL>

Working with large volumes of data

You describe your tables as containing many hundreds of millions of rows. Such huge tables are like black holes, they have a different physics. There are various potential approaches, depending on your needs, timescales, finances, database version and edition, and any other usage of your system's data. It's more than a five minute answer.

But here's the five minute answer anyway.

Assuming your table is the live table it is presumably being populating by adding forecasts as they occur, which is basically an appending operation. This would mean forecasts for any given station are scattered throughout the table. Consequently indexes on just STATIONID or even FORECASTDATE would have a poor clustering factor.

On that assumption, the one thing I would suggest you try first is building an index on (STATIONID, FORCASTDATE, FORECASTINTERVAL, FORECASTCREATED, FORECASTVALUE). This will take some time (and disk space) to build, but it ought to speed up your subsequent queries quite considerably, because it has all the columns needed to satisfy the query with an INDEX RANGE SCAN without touching the table at all.

APC 2009-10-05 05:42:12

Thanks for clearing up on the `partition by`. I've run these solutions a few times now. The tables are so huge that to get an immediate response I've placed a restriction of the last 10 days in my where clause to force a small index search e.g (select * from t where t.forecastdate > sysdate - 10) tNot an ideal solution, I guess the alternative is to force a full table scan using /*+ full(t) */ (considering that it seems indexing is taking too long even for 10 days using the previous approach.)?

Simon Edwards 2009-10-05 05:51:02

@APC: thx, seldom get to use LEAD/LAG.

OMG Ponies 2009-10-05 14:44:18

ansaurus

tags:

views:

answers:

Oracle: Find previous record for a ranked list of forecasts

related questions