tags:

views:

516

answers:

7

My problem:

I got a MySQL database that stores a great amount of meteorological data in chronological order (New data are inserted every 10 min). Unfortunately there have been several blackouts and hence certain rows are missing. I recently managed to obtain certain backup-files from the weather station and now I want to use these to fill in the missing data.

The DB ist structures like this (example):

date*            the data    
2/10/2009 10:00  ...
2/10/2009 10:10  ...
( Missing data!)
2/10/2009 10:40  ...
2/10/2009 10:50  ...
2/10/2009 11:00  ...
...

*=datatime-type, primary key

My idea:

Since backup and database are located on different computers and traffic is quite slow, I thought of creating a MySQL-query that, when run, will return a list of all missing dates in a specified range of time. I could then extract these dates from the backup and insert them to the database.

The question:

How to write such a query? I don't have the permission to create any auxilary table. Is it possible to formulate a "virtual table" of all required dates in the specified interval and then use it in a JOIN? Or are there entirely different propositions for solving my problem?

Edit: Yes, the timestamps are consistently in the form shown above (always 10 minutes), except that some are just missing.

Okay, what about the temporary tables? Is there an elegant way of populating them with the time-range automatically? What if two scripts try to run simultaneously, does this cause problems with the table?

A: 

Create a temporary table (JOIN). Or take all the dates and query them locally, where you should have free reign (loop/hash).

For the JOIN, your generated reference of all dates is your base table and your data is your joined table. Seek out pairs where the joined data does not exist and select the generated date.

Mark Canlas
+1  A: 

If you can create a temporary table, you can solve the problem with a JOIN

CREATE TEMPORARY TABLE DateRange
(theDate DATE);

Populate the table with all 10 minute intervals between your dates, then use the following

SELECT theDate
FROM DateRange dr
LEFT JOIN Meteorological mm on mm.date = dr.theDate
WHERE mm.date IS NULL

The result will be all of the date/times that do not have entries in your weather table.

If you need to quickly find days with missing data, you can use

select Date(mm.Date),144-count(*) as TotMissing 
from Meteorological mm 
group by Date(mm.Date) 
having count(*) < 144

This is assume 24 hour a day, 6 entries per hour (hence 144 rows). – Sparky 0 secs ago

Sparky
Yes, I already tried your last solution, but - of course - it doesn'T find complete *days* or even months missing.
Dario
Didn't realize you could have complete days missing as well. In that case, you would need some sort of date table to compare against. Good luck...
Sparky
Or I use this approach recursively ... but that's perhaps not that elegant.Thanks anyway
Dario
A: 

Do a self join and then calculate the max values that are smaller and have a difference larger than your interval.

In Oracle I'd do it like this (with ts being the timestamp column):

Select t1.ts, max(t2.ts)
FROM atable t1 join atable t2 on t1.ts > t2.ts
GROUP BY t1.ts
HAVING (t1.ts - max(t2.ts))*24*60 > 10

There will be better ways to handle the difference calculation in mySql, but I hope the idea comes across.

This query will give you the timestamps directly after and before outage, and you can build from there.

IronGoofy
Aargh, I really liked this query (HAVING clauses are often neglected). Unfortunately, tests in SQL Server are running very slowly. The query plan shows the join being fully resolved before the aggregation is done. The join returns (n)+(n-1)+(n-2)+(n-3)+... rows; I.e. O(n^2) . -1 Sorry.
Craig Young
A: 

As a quick solotion using Sql Server, check for dates that do not have a follower of date+interval. I think MySql does have some sort of dateadd function, but you can try something like this. This will show you the ranges where you have missing data.

DECLARE @TABLE TABLE(
     DateValue DATETIME
)

INSERT INTO @TABLE SELECT '10 Feb 2009 10:00:00'
INSERT INTO @TABLE SELECT '10 Feb 2009 10:10:00'
INSERT INTO @TABLE SELECT '10 Feb 2009 10:40:00'
INSERT INTO @TABLE SELECT '10 Feb 2009 10:50:00'
INSERT INTO @TABLE SELECT '10 Feb 2009 11:00:00'

SELECT  *
FROM    @TABLE currentVal
WHERE   ((SELECT * FROM @TABLE nextVal WHERE DATEADD(mi,10,currentVal.DateValue) = nextVal.DateValue) IS NULL AND currentVal.DateValue != (SELECT MAX(DateValue) FROM @TABLE))
OR   ((SELECT * FROM @TABLE prevVal WHERE DATEADD(mi,-10,currentVal.DateValue) = prevVal.DateValue) IS NULL  AND currentVal.DateValue != (SELECT MIN(DateValue) FROM @TABLE))
astander
Nice idea, but it might be a little too slow if there is lots of data. I think it would be faster if you used joins instead of inner selects. Having said that, the best way to find out is to try it.
Mark Byers
I haven't tested this, but I have my doubts. I assume the intention is that: the odd numbered rows in the result-set will be the (exclusive) starts of gaps ranges, and the even numbered rows the (exclusive) ends of the gap ranges? If so, then the data has hardly been simplified, you still need to iterate your data **in sorted order** to fill in the gaps. In which case, you might as well have done so with `SELECT DateValue FROM @TABLE ORDER BY 1` in the first place.
Craig Young
This checks for rows where there is not *next* item or *previous* item given the time inerval, and ignores the boundry items, thus the *min* and *max* queries.
astander
I said I had my doubts... it's tested and confirmed! As explained by astander above: Rows are included in the result-set if there is _not a next item_ **or** if there is _not a previous item_ ... therein lies the problem! We cannot tell which condition resulted in the inclusion; more importantly what if a row matches both conditions?! (This sample fails (# for missing rows, bold for results): 10:00 **10:10** # **10:30** # **10:50** 11:00 **11:10** # # **11:40** ). Furthermore, I stand by my other concern: Structurally, the result-set is no more useful than the source data.
Craig Young
A: 

Note: uses MSSQL syntax. I think MySQL uses DATE_ADD(T1.date, INTERVAL 10 MINUTE) instead of DATEADD, but I haven't tested this.

You can get the missing timestamps with two self-joins:

SELECT T1.[date] AS DateFrom, MIN(T3.[date]) AS DateTo
    FROM [test].[dbo].[WeatherData] T1
    LEFT JOIN [test].[dbo].[WeatherData] T2 ON DATEADD(MINUTE, 10, T1.date) = T2.date
    LEFT JOIN [test].[dbo].[WeatherData] T3 ON T3.date > T1.Date
    WHERE T2.[value] IS NULL
    GROUP BY T1.[date]

If you have a lot of data, You might want to try restricting the range to one month at a time to avoid heavy load on your server, as this operation could be quite intensive.

The results will be something like this:

DateFrom                    DateTo
2009-10-02 10:10:00.000 2009-10-02 10:40:00.000
2009-10-02 11:00:00.000 NULL

The last row represents all data from the last timestamp into the future.

You can then use another join to get the rows from the other database that have a timestamp in between any of these intervals.

Mark Byers
+3  A: 
select t1.ts as hival, t2.ts as loval
from metdata t1, metdata t2
where t2.ts = (select max(ts) from metdata t3
where t3.ts < t1.ts)
and not timediff(t1.ts, t2.ts) = '00:10:00'

This query will return couplets you can use to select the missing data. The missing data will have a timestamp between hival and loval for each couplet returned by the query.

EDIT - thx for checking, Craig

EDIT2 :

getting the missing timestamps - this SQL gets a bit harder to read, so I'll break it up a bit. First, we need a way to calculate a series of timestamp values between a given low value and a high value in 10 minute intervals. A way of doing this when you can't create tables is based on the following sql, which creates as a resultset all of the digits from 0 to 9.

select d1.* from 
(select 1 as digit
union select 2 
union select 3 
union select 4 
union select 5 
union select 6 
union select 7 
union select 8 
union select 9 
union select 0 
) as d1

...now by combining this table with a copy of itself a couple of times means we can dynamically generate a list of a specified length

select curdate() + 
INTERVAL  (d1.digit * 100 + d2.digit * 10 + d3.digit) * 10 MINUTE 
as date 
from (select 1 as digit
union select 2 
union select 3 
union select 4 
union select 5 
union select 6 
union select 7 
union select 8 
union select 9 
union select 0 
) as d1
join
(select 1 as digit
union select 2 
union select 3 
union select 4 
union select 5 
union select 6 
union select 7 
union select 8 
union select 9 
union select 0 
) as d2
join
(select 1 as digit
union select 2 
union select 3 
union select 4 
union select 5 
union select 6 
union select 7 
union select 8 
union select 9 
union select 0 
) as d3
where (d1.digit * 100 + d2.digit * 10 + d3.digit) between 1 and 42
order by 1

... now this piece of sql is getting close to what we need. It has 2 input variables:

  1. a starting timestamp (I used curdate() in the example); and a
  2. number of iterations - the where clause specifies 42 iterations in the example, maximum with 3 x digit tables is 1000 intervals

... which means we can use the original sql to drive the example from above to generate a series of timestamps for each hival lowval pair. Bear with me, this sql is a bit long now...

select daterange.loval + INTERVAL  (d1.digit * 100 + d2.digit * 10 + d3.digit) * 10 MINUTE as date 
from 
(select t1.ts as hival, t2.ts as loval
from metdata t1, metdata t2
where t2.ts = (select max(ts) from metdata t3
where t3.ts < t1.ts)
and not timediff(t1.ts, t2.ts) = '00:10:00'
) as daterange
join
(select 1 as digit
union select 2 
union select 3 
union select 4 
union select 5 
union select 6 
union select 7 
union select 8 
union select 9 
union select 0 
) as d1
join
(select 1 as digit
union select 2 
union select 3 
union select 4 
union select 5 
union select 6 
union select 7 
union select 8 
union select 9 
union select 0 
) as d2
join
(select 1 as digit
union select 2 
union select 3 
union select 4 
union select 5 
union select 6 
union select 7 
union select 8 
union select 9 
union select 0 
) as d3
where (d1.digit * 100 + d2.digit * 10 + d3.digit) between 1 and
 round((time_to_sec(timediff(hival, loval))-600) /600)
order by 1

...now there's a bit of epic sql
NOTE : using the digits table 3 times gives a maximum gap it will cover of a bit over 6 days

Steve De Caux
I'm sorry but I'm too set in my ways to voluntarily use "left join" syntax !
Steve De Caux
Nice idea - Seems to have a suboptimal computational complexity, but very elegant nontheless.
Dario
Steve De Caux
This is a **very** inefficient query. It is performing a cross-join between t1 and t2, and the filtering of rows from the join requires a computation on every permutation of the cross-join. This solution will not scale well with many rows. @Steve: Perhaps your Oracle tables tended to have more rows in them?
Craig Young
Hate to disagree with you Craig, but I haven't got an explain handy to show you why. The query works of off the PK on table metdata in 1 indexscan. Last time I ran it was on a table with 1.6 M rows, query returned in under a second (on a prod server during business hours)
Steve De Caux
Changed -1 to +1 @Steve: Well, you certainly piqued my curiosity, so I decided to put it to the test (MS SQL Server though); and I admit it's much faster than I expected. The query plan showed that the optimser had effectively adjusted the condition to `(t1.ts - 10min) = t2.ts`. This allowed a hash join, and perfectly acceptable performance for this problem. That DB2 could resolve this query to a single index scan is very impressive!! I did my tests with 20 years sample data: it took between 4 and 14 seconds depending on the percentage of gaps. (My solution was a consistent 3-4 seconds).
Craig Young
Bah - won't let me change my vote unless answer is edited.
Craig Young
Duly edited, thx for checking this SQL Craig on MSSQL.Note on the single indexscan : this is not exactly what the Oracle plan shows but reading between the lines it is always using the index data that it already has in hte buffers from the indexspacescan to join back to itself. Note that the results I quoted are from Oracle 9i. When the query is not based on the PK, then Oracle needs a little coaching to get it right.
Steve De Caux
@Craig - I'd be interested to know if this solution works on mssql as well - only if you have the inclination !
Steve De Caux
A: 

This solution uses sub-queries, and there is no need for any explicit temporary tables. I've assumed your backup data is in another database on the other machine; if not you'd only need to do up to step 2 for the result-set you need, and write your program to update the main database accordingly.

The idea is to start out by producing a 'compact' result-set summarising the gap-list. I.e. the following data:

MeasureDate
2009-12-06 13:00:00
2009-12-06 13:10:00
--missing data
2009-12-06 13:30:00
--missing data
2009-12-06 14:10:00
2009-12-06 14:20:00
2009-12-06 14:30:00
--missing data
2009-12-06 15:00:00

Would be transformed into the following where actual gaps are strictly between (i.e. exclusive of) the endpoints:

GapStart            GapEnd
2009-12-06 13:10:00 2009-12-06 13:30:00
2009-12-06 13:30:00 2009-12-06 14:10:00
2009-12-06 14:30:00 2009-12-06 15:00:00
2009-12-06 15:00:00 NULL

The solution query is built up as follows:

  1. Obtain all MeasureDates that don't have an entry 10 minutes later as this will be the start of a gap. NOTE: The last entry will be included even though not strictly a gap; but this won't have any adverse effects.
  2. Augment the above by adding the end of the gap using the first MeasureDate after the start of the gap.
  3. NOTE: The gap-list is compact, and unless you have an exceptionally high prevalence of fragmented gaps, it should not consume much bandwidth in passing that result-set to the backup machine.
  4. Use an INNER JOIN with inequalities to identify any missing data that may be available in the backup. (Run tests and checks to verify the integrity of your backup data.)
  5. Assuming your backup data is sound, and won't produce anomalous unfounded spikes in your measurements, INSERT the data in your main database.

The following query should be tested (preferably adjusted to run on the backup server for performance reasons).

/* TiC Copyright
This query is writtend (sic) by me, and cannot be used without 
expressed (sic) written permission. (lol) */

/*Step 3*/
SELECT  gap.GapStart, gap.GapEnd,
        rem.MeasureDate, rem.Col1, ...
FROM    (
        /*Step 2*/
        SELECT  gs.GapStart, (
                SELECT  MIN(wd.MeasureDate)
                FROM    WeatherData wd
                WHERE   wd.MeasureDate > gs.GapStart
                ) AS GapEnd
        FROM    (
                /*Step 1*/
                SELECT  wd.MeasureDate AS GapStart
                FROM    WeatherData wd
                WHERE   NOT EXISTS (
                        SELECT  *
                        FROM    WeatherData nxt
                        WHERE   nxt.MeasureDate = DATEADD(mi, 10, wd.MeasureDate)
                        )
                ) gs
        ) gap
        INNER JOIN RemoteWeatherData rem ON
            rem.MeasureDate > gap.GapStart
        AND rem.MeasureDate < gap.GapEnd

The insert...

INSERT INTO WeatherData (MeasureDate, Col1, ...)
SELECT  /*gap.GapStart, gap.GapEnd,*/
        rem.MeasureDate, rem.Col1, ...
...
Craig Young