views:

649

answers:

5

How can you join between a table with a sparse number of dates and another table with an exhaustive number of dates such that the gaps between the sparse dates take the values of the previous sparse date?

Illustrative example:

PRICE table (sparse dates):
date        itemid  price
2008-12-04  1       $1
2008-12-11  1       $3
2008-12-15  1       $7


VOLUME table (exhaustive dates):
date         itemid  volume_amt
2008-12-04   1       12345
2008-12-05   1       23456
2008-12-08   1       34567
2008-12-09   1       ...
2008-12-10   1
2008-12-11   1
2008-12-12   1
2008-12-15   1
2008-12-16   1
2008-12-17   1
2008-12-18   1

Desired result:

date       price  volume_amt
2008-12-04 $1     12345
2008-12-05 $1     23456
2008-12-08 $1     34567
2008-12-09 $1     ...
2008-12-10 $1
2008-12-11 $3
2008-12-12 $3
2008-12-15 $7
2008-12-16 $7
2008-12-17 $7
2008-12-18 $7

Update:

A couple people have suggested a correlated subquery that accomplishes the desired result. (Correlated subquery = a subquery that contains a reference to the outer query.)

This will work; however, I should have noted that the platform I'm using is MySQL, for which correlated subqueries are poorly optimized. Any way to do it without using a correlated subquery?

+2  A: 
SELECT v.date, p.price, v.volume
FROM volume v
LEFT JOIN Price p ON p.itemID=v.itemID
    AND p.[date] = (
                    SELECT MAX([date] )
                    FROM price p2 
                    WHERE p2.[date] <= v.[date] AND p2.itemid= v.itemid
                    GROUP BY p2.[date]
                   )
Joel Coehoorn
won't work -- will return volume only on the days when there is a new price. desired output is the volume should be returned for the price on that day or the nearest previous day for which there is a new price.
ʞɔıu
A: 
 SELECT  Volume.date, volume.itemid, price.price, volume.volume_amt
 FROM Volume 
 LEFT OUTER JOIN Price
 ON Volume.date = Price.date

Probably. My SQL-fu is weak

Paul
Nope, that would give him blanks for price in the rows that don't have exact matches.
Dave Costa
Ah, gotcha. We're doing share trades or something and want last-traded-price. <thinks>
Paul
+3  A: 

Assuming there is only 1 price per date/itemid:

select v.date, v.itemid, p.price
from volume v
join price p on p.itemid = v.item_id
where p.date = (select max(p2.date) from price p2
                where p2.itemid = v.itemid
                and p2.date <= v.date);
Tony Andrews
Same answer as mine, but I'll admit that you got it right faster even if mine shows that I posted sooner.
Joel Coehoorn
actually I believe this contains a mistake: the p.date condition belongs in the join condition, not in the where clause.
ʞɔıu
Does it make any difference?
Tony Andrews
Not in this case, but it would if you were doing an outer join.
Bill Karwin
my comment about the p.date comparison belonging in the where or on clauses was not correct; it doesn't make a difference.
ʞɔıu
A: 

This method works in Oracle. Don't know about other databases, and you didn't specify. If this exact syntax doesn't work in your database, I would guess there are similar techniques.

dev> select * from price;

AS_OF               ID     AMOUNT
----------- ---------- ----------
04-Dec-2008          1          1
11-Dec-2008          1          2
15-Dec-2008          1          3

dev> select * from volume;

DAY                 ID     VOLUME
----------- ---------- ----------
05-Dec-2008          1          1
06-Dec-2008          1          2
07-Dec-2008          1          3
08-Dec-2008          1          4
09-Dec-2008          1          5
10-Dec-2008          1          6
11-Dec-2008          1          7
12-Dec-2008          1          8
13-Dec-2008          1          9
14-Dec-2008          1         10
15-Dec-2008          1         11
16-Dec-2008          1         12
17-Dec-2008          1         13
18-Dec-2008          1         14
19-Dec-2008          1         15
20-Dec-2008          1         16
21-Dec-2008          1         17
22-Dec-2008          1         18
23-Dec-2008          1         19

dev> select day, volume, amount from (
  2    select day, volume, (select max(as_of) from price p where p.id = v.id and as_of <= day) price_as_of
  3      from volume v
  4  )
  5  join price on as_of = price_as_of
  6  order by day;

DAY             VOLUME     AMOUNT
----------- ---------- ----------
05-Dec-2008          1          1
06-Dec-2008          2          1
07-Dec-2008          3          1
08-Dec-2008          4          1
09-Dec-2008          5          1
10-Dec-2008          6          1
11-Dec-2008          7          2
12-Dec-2008          8          2
13-Dec-2008          9          2
14-Dec-2008         10          2
15-Dec-2008         11          3
16-Dec-2008         12          3
17-Dec-2008         13          3
18-Dec-2008         14          3
19-Dec-2008         15          3
20-Dec-2008         16          3
21-Dec-2008         17          3
22-Dec-2008         18          3
23-Dec-2008         19          3
Dave Costa
+7  A: 

This isn't as simple as a single LEFT OUTER JOIN to the sparse table, because you want the NULLs left by the outer join to be filled with the most recent price.

EXPLAIN SELECT v.`date`, v.volume_amt, p1.item_id, p1.price
FROM Volume v JOIN Price p1
  ON (v.`date` >= p1.`date` AND v.item_id = p1.item_id)
LEFT OUTER JOIN Price p2
  ON (v.`date` >= p2.`date` AND v.item_id = p2.item_id
    AND p1.`date` < p2.`date`)
WHERE p2.item_id IS NULL;

This query matches Volume to all rows in Price that are earlier, and then uses another join to make sure we find only the most recent price.

I tested this on MySQL 5.0.51. It uses neither correlated subqueries nor group by.

edit: Updated the query to match to item_id as well as date. This seems to work too. I created an index on (date) and an index on (date, item_id) and the EXPLAIN plan was identical. An index on (item_id, date) may be better in this case. Here's the EXPLAIN output for that:

+----+-------------+-------+------+---------------+---------+---------+-----------------+------+--------------------------------------+
| id | select_type | table | type | possible_keys | key     | key_len | ref             | rows | Extra                                |
+----+-------------+-------+------+---------------+---------+---------+-----------------+------+--------------------------------------+
|  1 | SIMPLE      | p1    | ALL  | item_id       | NULL    | NULL    | NULL            |    6 |                                      | 
|  1 | SIMPLE      | v     | ref  | item_id       | item_id | 22      | test.p1.item_id |    3 | Using where                          | 
|  1 | SIMPLE      | p2    | ref  | item_id       | item_id | 22      | test.v.item_id  |    1 | Using where; Using index; Not exists | 
+----+-------------+-------+------+---------------+---------+---------+-----------------+------+--------------------------------------+

But I have a very small data set, and the optimization may depend on larger data sets. You should experiment, analyzing the optimization using a larger data set.

edit: I pasted the wrong EXPLAIN output before. The one above is corrected, and shows better use of the (item_id, date) index.

Bill Karwin
Looks good so far, What if we assume there could be more than one possible item_id, just stick v.item_id=p1.item_id and v.item_id=p2.item_id in the on clauses, right? Also, if you're really feeling up to it, is it better to add indexes on (item_id, date) or (date, item_id)?
ʞɔıu