views:

105

answers:

6

Hi Guys,

The database I'm currently working on is for a scientific instrument that records data at certain points in time. My instrument records per second, but with a resolution of 5 seconds.

ID      Total Particles  DateandTime
38313   602              2009-01-27 16:25:48.000
38314   602              2009-01-27 16:25:49.000
38315   602              2009-01-27 16:25:50.000
38316   602              2009-01-27 16:25:51.000
38317   602              2009-01-27 16:25:52.000
38318   553              2009-01-27 16:25:53.000
38319   553              2009-01-27 16:25:54.000
38320   553              2009-01-27 16:25:55.000
38321   553              2009-01-27 16:25:56.000
38322   553              2009-01-27 16:25:57.000
38323   515              2009-01-27 16:25:58.000
38324   515              2009-01-27 16:25:59.000
38325   515              2009-01-27 16:26:00.000
38326   515              2009-01-27 16:26:01.000
38327   515              2009-01-27 16:26:02.000

This is useful for us in some situations, but isn't relevant for this part of the project. I'd like to collapse this data so that there was one result per 5 seconds (say the final second can hold all the data for convention).

I'd prefer a solution that didn't skip 4 rows and display the 5th as that may not work if my instument messed up and only gave me 4 rows in the db. What I'd like is some way of comparing the rows around each sample and condensing based on that.

I don't really know where to start with an operation like this so any help would be appreciated.

A: 

Probably the best way (if you want to keep all the 'raw' data) is to have some sort of trigger that inserts into a 'summary' table when the 'raw' table is inserted into. This trigger will discern the basis to pass the information up, depending on the time. It will mean you'll get a lot of data, but I expect that it is appropriate to keep it around. It's worth noting that doing it via a trigger could be bad (slow) because it will occur for each insert. It may be more appropriate to trap this information at the application level, but I don't know if that's possible ...

Noon Silk
+1  A: 

if the stream is fairly predictable, you could do something like:

select id, particles, time from log where
id in (select min(id) from log group by
datediff(second, (select min(time) from log), time) / 5)

would get the first reading from every 5 second interval. the first interval covers seconds 1 - 5 in the log. the next interval is seconds 6 - 10 etc. there is no overlap.

you can do a similar query to get the average over each interval, but since you expect readings in the same interval to be identical, an avg may blend the data in the case that there are missing or mistimed entries, so just taking the first or last reading is probably better.

jspcal
This did fairly well on my data, but not 100%. As you say, if the data was predictable this would work great; but my data isn't that forgiving unfortunately. Thanks.
Geodesic
+1  A: 

I left a comment - I'm going to assume that a physical data point still takes up 5 seconds even if you don't get 5 rows. If so, the following might work.

First, create a seconds function:

CREATE FUNCTION dbo.GetTotalSeconds(@dt datetime)
RETURNS bigint
AS
BEGIN
    RETURN DATEDIFF(ss, '1753-01-01', @dt)
END

Now this query:

DECLARE
    @StartID int,
    @EndID int

SET @StartID = 38313
SET @EndID = 40313

DECLARE @StartSeconds bigint

SELECT TOP 1 @StartSeconds = dbo.GetSeconds(DateandTime)
FROM DataPoints
WHERE ID >= @StartID
ORDER BY ID ASC

SELECT p.ID, p.Total, p.Particles, p.DateandTime
FROM
(
    SELECT DISTINCT
        (dbo.GetSeconds(DateandTime) - @StartSeconds) / 5 AS SecondsInterval
    FROM DataPoints
    WHERE ID >= @StartID
    AND ID <= @EndID
) g
INNER JOIN DataPoints p
ON (dbo.GetSeconds(p.DateandTime) - @StartSeconds) = g.SecondsInterval
ORDER BY g.SecondsInterval

Performance will be pretty weak but it should handle all the edge cases.

Aaronaught
This really helped me figure out the issue. Didn't seem to work out as I expected though - it seemed to find the 5 similar values after StartID then stop. I'm sure with a little more work this would have been the better solution though.
Geodesic
+2  A: 

Use the usual ROW_NUMBER() partitioning for duplicates, and use an appropriate partitoning scheme.

with cte as (
 select ID, Total, Particles, DateandTime,
   row_number() over (
    partition by datediff(seconds,'19000101',DateandTime)/5
    order by DateandTime) as rn
 from table)
select * from cte
where rn = 1; 

Can't comment on how efficient this is w/o knowing the exact schema (including indexes) of the tables.

Remus Rusanu
This answer was the closest to what I needed, but it kept had issues when my data wasn't exactly the way it should. Thanks for the help!
Geodesic
I forgot to divide by 5 to group all records in the same 5 seconds interval together. Updated.
Remus Rusanu
A: 

Thanks for all your help guys. I think if I had a bit more time on this issue I would have been able to use all your samples and make something work directly in SQL - but I don't really have that liberty. The project I'm working on is primarily written in MatLab, so what I've done is pulled the ID and TotalParticles data out of the DB, then in Matlab I ran checks on the TotalParticles Column. The output of this script are the ID values that I require for the collapsed database, which I'll feed in to a simple select statement and that in turn will make the view I require.

Geodesic
+1  A: 

OK, so I had some time to convert my MatLab solution to SQL. Not too sure if it's really that great, but it works quite well. My MatLab solution took over 30 minutes to compute a result for the 2.2 million rows I have, the SQL version took 4 minutes!

Here's the code for reference if anyone has a similar issue in the future:

DECLARE @i INT
DECLARE @Count INT
DECLARE @Selection INT
DECLARE @Result TABLE (ID INT)

SET @i = 1
SET @Count = (SELECT count(*) FROM CFLAPS_AllStations)

WHILE (@i <= @Count)
BEGIN
   IF @i < @Count - 4
    BEGIN
        SET @Selection = (SELECT TotalParticlesCount FROM CFLAPS_AllStations WHERE ID = @i)
        IF  @Selection = (SELECT TotalParticlesCount FROM CFLAPS_AllStations WHERE ID = @i+4)
        BEGIN   
            INSERT INTO @Result SELECT ID FROM CFLAPS_AllStations WHERE ID = @i
        END
        ELSE
        BEGIN
            INSERT INTO @Result SELECT ID FROM CFLAPS_AllStations WHERE ID = @i

            SET @i =  
            CASE 
                WHEN @Selection = (SELECT TotalParticlesCount FROM CFLAPS_AllStations WHERE ID = @i+3) THEN @i-1
                WHEN @Selection = (SELECT TotalParticlesCount FROM CFLAPS_AllStations WHERE ID = @i+2) THEN @i-2
                WHEN @Selection = (SELECT TotalParticlesCount FROM CFLAPS_AllStations WHERE ID = @i+1) THEN @i-3
                WHEN @Selection = (SELECT TotalParticlesCount FROM CFLAPS_AllStations WHERE ID = @i) THEN @i-4
            END
        END
   END
   ELSE
   BEGIN
      INSERT INTO @Result SELECT ID FROM CFLAPS_AllStations WHERE ID = @i
   END
   SET @i = @i+5
END

SELECT p.ID, p.TotalParticlesCount, p.FullDateTime FROM CFLAPS_AllStations as p, @Result as q WHERE p.ID = q.ID

Just so you all know why this was important: The instrument would always log 5 seconds of identical data, but if it stopped recording at say, second 3 of the cycle; it would not record the 4th and 5th seconds and therefore gaps in the data appeared. I managed to check for that with this while loop and managed to capture all these inconsistancies.

Final Output:

ID  TotalParticles  DateTime
1       745     2009-06-23 00:00:00.000
6       727     2009-06-23 00:00:05.000
11      771     2009-06-23 00:00:10.000
16      837     2009-06-23 00:00:15.000
21      768     2009-06-23 00:00:20.000
26      703     2009-06-23 00:00:25.000
31      822     2009-06-23 00:00:30.000
36      730     2009-06-23 00:00:35.000
41      731     2009-06-23 00:00:40.000
46      706     2009-06-23 00:00:45.000
51      733     2009-06-23 00:00:50.000
...
2290089 677     2009-06-22 23:59:15.000
2290094 720     2009-06-22 23:59:20.000
2290099 771     2009-06-22 23:59:25.000
2290104 770     2009-06-22 23:59:30.000
2290109 761     2009-06-22 23:59:35.000
2290114 851     2009-06-22 23:59:40.000
2290119 801     2009-06-22 23:59:45.000
2290124 754     2009-06-22 23:59:50.000
2290129 702     2009-06-22 23:59:55.000

Thank you to all that helped out.

Geodesic