views:

88

answers:

3

I've looked all over and haven't yet found an intelligent way to handle this, though I feel sure one is possible:

One table of historical data has quarterly information:

CREATE TABLE Quarterly (
unique_ID INT UNSIGNED NOT NULL,
date_posted DATE NOT NULL,
datasource TINYINT UNSIGNED NOT NULL,
data FLOAT NOT NULL,
PRIMARY KEY (unique_ID));

Another table of historical data (which is very large) contains daily information:

CREATE TABLE Daily (
unique_ID INT UNSIGNED NOT NULL,
date_posted DATE NOT NULL,
datasource TINYINT UNSIGNED NOT NULL,
data FLOAT NOT NULL,
qtr_ID INT UNSIGNED,
PRIMARY KEY (unique_ID));

The qtr_ID field is not part of the feed of daily data that populated the database - instead, I need to retroactively populate the qtr_ID field in the Daily table with the Quarterly.unique_ID row ID, using what would have been the most recent quarterly data on that Daily.date_posted for that data source.

For example, if the quarterly data is

101 2009-03-31 1 4.5
102 2009-06-30 1 4.4
103 2009-03-31 2 7.6
104 2009-06-30 2 7.7
105 2009-09-30 1 4.7

and the daily data is

1001 2009-07-14 1 3.5 ??
1002 2009-07-15 1 3.4 &&
1003 2009-07-14 2 2.3 ^^

then we would want the ?? qtr_ID field to be assigned '102' as the most recent quarter for that data source on that date, and && would also be '102', and ^^ would be '104'.

The challenges include that both tables (particularly the daily table) are actually very large, they can't be normalized to get rid of the repetitive dates or otherwise optimized, and for certain daily entries there is no preceding quarterly entry.

I have tried a variety of joins, using datediff (where the challenge is finding the minimum value of datediff greater than zero), and other attempts but nothing is working for me - usually my syntax is breaking somewhere. Any ideas welcome - I'll execute any basic ideas or concepts and report back.

+1  A: 

Just subquery for the quarter id using something like:

(
 SELECT unique_ID 
 FROM Quarterly 
 WHERE 
     datasource = ? 
     AND date_posted >= ? 
 ORDER BY
     unique_ID ASC
 LIMIT 1
)

Of course, this probably won't give you the best performance, and it assumes that dates are added to Quarterly sequentially (otherwise order by date_posted). However, it should solve your problem.

You would use this subquery on your INSERT or UPDATE statements as the value of your qtr_ID field for your Daily table.

Kenaniah
Unfortunately, that doesn't exactly do the job -- it does pull out the correct datasource, but it ranks based on the time passed since (and then provides the first entry of) the earliest Quarterly data, not what would have been the most recent. So, in the example above, it returns '101' for ALL data from datasource = '1', and '103' for ALL data from datasource = '2'. Here's what I ran:UPDATE Daily SET qtr_ID = (SELECT unique_IDFROM QuarterlyWHERE Daily.datasource = Quarterly.datasource AND Daily.date_posted >= Quarterly.date_postedORDER BY date_posted ASCLIMIT 1)
Hank
By the way, this suggestion is MUCH faster than my successful query below.
Hank
The big difference in speed kept me working on the suggestion above, which led to the discovery that it was just one small change to be made (sorting by date_posted in the other direction) so that this is the code that works, and is fast: UPDATE Daily SET qtr_ID = ( SELECT unique_ID FROM Quarterly WHERE Daily.datasource = Quarterly.datasource AND Daily.date_posted >= Quarterly.date_posted ORDER BY date_posted DESC LIMIT 1 ); Thanks a bunch Kenaniah
Hank
Hank, the reason it's faster probably has to do with the fact that unique_ID is an indexed field. This is assuming that date_posted is not.
Kenaniah
A: 

The following appears to work exactly as intended but it surely is ugly (with three calls to the same DATEDIFF!!), perhaps by seeing a working query someone might be able to further reduce it or improve it:

UPDATE Daily SET qtr_ID = (select unique_ID from Quarterly
WHERE Quarterly.datasource = Daily.datasource AND
DATEDIFF(Daily.date_posted, Quarterly.date_posted) = 
(SELECT MIN(DATEDIFF(Daily.date_posted, Quarterly.date_posted)) from Quarterly
WHERE Quarterly.datasource = Daily.datasource AND
DATEDIFF(Daily.date_posted, Quarterly.date_posted) > 0));
Hank
See the above response for a MUCH faster solution than this one.
Hank
A: 

After more work on this query, I ended up with enormous performance improvements over the original concept. The most important improvement was to create indices in both the Daily and Quarterly tables - in Daily I created indices on (datasource, date_posted) and (date_posted, datasource) USING BTREE and on (datasource) USING HASH, and in Quarterly I did the same thing. This is overkill but it made sure I had an option that the query engine could use. That reduced the query time to less than 1% of what it had been. (!!)

Then, I learned that given my particular circumstances I could use MAX() instead of ORDER BY and LIMIT so I use a call to MAX() to get the appropriate unique_ID. That reduced the query time by about 20%.

Finally, I learned that with the InnoDB storage engine I could segment the chunk of the Daily table that I was updating with any one query, which allowed me to multi-thread the queries with a little elbow-grease and scripting. The parallel processing worked well and every thread reduced the query time linearly.

So, the basic query that is performing literally 1000 times better than my own first attempt is:

UPDATE Daily
SET qtr_ID =
(
  SELECT MAX(unique_ID)
  FROM Quarterly
  WHERE Daily.datasource = Quarterly.datasource AND
        Daily.date_posted > Quarterly.dateposted
)
WHERE unique_ID > ScriptVarLowerBound AND
      unique_ID <= ScriptVarHigherBound
;
Hank