views:

304

answers:

8

So here's yet another 'write a query to X' challenge.

I'm monitoring a number of networked vending machines. Each machine has a number of parts, e.g. bank note acceptor, coin system, printer and so on.

Problems with machine parts are logged in table, let's call it 'faults', which looks something like this (irrelevant fields omitted):

machineid           partid         start_time            end_time
---------           ------         ----------------      ----------------
       1                2          2009-10-05 09:00      NULL
       1                3          2009-10-05 08:00      2009-10-05 10:00
       2                2          2009-09-30 12:00      2009-09-30 14:00
       3                4          2009-09-28 13:00      2009-09-28 15:00
       3                2          2009-09-28 12:00      2009-09-28 14:00

end_date is NULL if the problem is currently ongoing.

I need a query which show time periods for which the machine as a whole is down, and which can account for overlapping ranges, collapsing them down into a single record. So for the sample data above, it would produce:

machineid          start_time            end_time
---------          ----------------      ----------------
       1           2009-10-05 08:00      NULL
       2           2009-09-30 12:00      2009-09-30 14:00
       3           2009-09-28 12:00      2009-09-28 15:00

It's not tough to write procedural code to do this line by line, but a nice declarative SQL query would be more useful, more elegant. It seems like it ought to be possible, I just can't quite get there though.

SQL dialect is Oracle. Analytic functions are availabe if that would help.

Thanks!

A: 
SELECT machineid, min(start_time), max(ifnull(end_time, '3000-01-01 00:00'))
FROM faults
GROUP BY machineid

should do the job (replacing ifnull by the equivalent Oracle function if needed).

RC
No, it does not if times are not overlapping.
Lukasz Lysik
not if you have the same machine that went down at 2 different times like 3:00 - 5:00 and then 6:00 - 7:00, you would say it was down from 3:00 - 7:00 which is not true.
CSharpAtl
This query would possibly imply incorrect things. It will return a downtime period that goes from the earliest start_time to the latest end_time for all records for a given machineid. This time could include uptimes as well as downtimes!
sheepsimulator
I forgot this case, yes.
RC
A: 

I wish I had time to give a full answer, but here is a hint to find overlapping downtimes:

select a.machineid, a.start_time, a.end_time, b.start_time, b.end_time
from faults a,
     faults b,
where a.machineid = b.machineid
  and b.start_time >= a.start_time
  and b.start_time <= a.end_time;
sheepsimulator
This will only find 2 breakages caused by 2 parts. Not N
DVK
@DVK - True. As I said in the above answer, I didn't solve the problem completely. But it is part of the solution.
sheepsimulator
+2  A: 
SELECT  DISTINCT 
        t1.machineId, 
        MIN(t2.start_time) start_time, 
        MAX(COALESCE(t2.end_time, '3210/01/01')) end_time
FROM FAULTS t1
JOIN FAULTS t2 ON t1.machineId = t2.machineId
                  AND ((t2.start_time >= t1.start_time
                       AND (t1.end_time IS NULL OR t2.start_time <= t1.end_time)
               )
               OR
               (t1.start_time >= t2.start_time 
                       AND (t2.end_time IS NULL OR t1.start_time <= t2.end_time) 
               ))
GROUP BY t1.machineId, t1.part_id

I tested this query on the following data:

machine_id   |part_id |start_time     |end_time
-------------------------------------------------------------------------
1     |2   |05 Oct 2009 09:00:00 |NULL
1     |3   |05 Oct 2009 08:00:00 |05 Oct 2009 10:00:00
2     |2   |30 Sep 2009 12:00:00 |30 Sep 2009 14:00:00
2     |3   |30 Sep 2009 15:00:00 |30 Sep 2009 16:00:00
2     |4   |30 Sep 2009 16:00:00 |30 Sep 2009 17:00:00
3     |2   |28 Sep 2009 12:00:00 |28 Sep 2009 14:00:00
3     |4   |28 Sep 2009 13:00:00 |28 Sep 2009 15:00:00

I got this:

machine_id   |start_time          |end_time
-----------------------------------------------------------------
1     |05 Oct 2009 08:00:00 |01 Jan 3210 00:00:00
2     |30 Sep 2009 12:00:00 |30 Sep 2009 14:00:00
2     |30 Sep 2009 15:00:00 |30 Sep 2009 17:00:00
3     |28 Sep 2009 12:00:00 |28 Sep 2009 15:00:00
najmeddine
Won't this brak if there were 3 consecutive faults? of or am I mis-reading it?
DVK
I think you're mis-reading, or can you give me an example?
najmeddine
Please add |2|5|30 Sep 2009 17:00:00|30 Sep 2009 18:00:00| row and re-run your query. <P> If it produces 4 rows as above (with 3rd row ending at 18:00), your query works 100%. If 5 rows, I am not mis-reading and it does not work
DVK
@najmeddine - any luck? It would be so flipping cool if you aactually can do it in 1 query!
DVK
Indeed, with the row added the above doesn't do the job. Nice try though!
Res Cogitans
right now it's not working :|, i'll keep you informed...
najmeddine
+2  A: 

Basically, you can not do it (find a covering partition set of a forest) in pure set theory (e.g. as a bounded # of queries without a loop).

To do it in the most set-like way,

  1. Create a temp table for forest partitioning (10 or 11 columns, 4 from failure #1, 4 from failure #2, 1 for partition ID, 1 for round in which node was inserted, and 1 for assorted optimizations I can't think of with a fever of 38C.

  2. Run a loop (BFS or DFS, whatever you find to easier implement the forest partitioning algorithm in). The tricky part, compared to graphs, is that you can have many sub-trees joined from the top to current sub-tree

    You can use sheepsimulator's query as basic building block for the loop (e.g. finding 2 connected node)

  3. When the partitioning loop is done, simply do

   select min(p1.start_time), max(p2.end_time), p1.partition,p2.partition
   from partitions p1, partitions p2
   where p1.partition = p2.partition
   group by p1.partition,p2.partition


    /* This will need to be tweaked using COALESCE 
       to deal with NULL end times in obvious way) */

I apologize for not spelling the exact code for forest partitioning (it may be filed under tree partitioning) - I'm dead tired and I'm certain some Googling will yield one now that you know the tdata structure and problem name (or you can post this as a more precisely formulated Q on StackOverflow - e.g. "How to implement an algorithm for complete partitioning of a forest of trees as a loop in SQL".

DVK
P.S. I have a feeling you can optimize this to run in O(logN) and avoid the mins/maxs comuttation to boot by computing transitive edges in loop, but not 100% certain it will work
DVK
A: 

I believe you would need a stored proc to do this, or something like recursive 'Common Table Expressions (CTEs) (as exists in SQL srever), or otherwise (in a single SQL Statement) you would not be able to get the right answer when 3 or more rows togeher form a contiguous range of covered dates.

like:

 |----------|
           |---------------|
                        |----------------|

Without actually going through the exercise, I might suggest that in the stored proc, build a table of all "candidate dates" and then construct a table that contains all the dates that are NOT covered by a daterange in an existing row, then build your output resultset by "negating" this set.

Charles Bretana
A: 

Hehe.

In SIRA_PRISE, which supports interval types, solving this problem would be as easy as

SELECT machineID, period FROM Faults.

IN which 'period' is an attribute of the time interval type whose start and end points are start_time and end_time of your SQL table.

But since you are presumably forced to solve this in SQL, and with a system that does not support interval types, I can only wish you a lot of courage.

Two hints :

The union of two intervals can be handled in SQL using complex CASE constructs (if interval_values_overlap then lowest_start_time highest_end_time, all that sort of stuff).

Since you cannot tell beforehand how many rows will merge into one, you will presumably find yourself forced to write recursive SQL.

Erwin Smout
+4  A: 

Hi Res,

using analytics, you can build a query that will make a single pass on the data (with a large data set this will be the most efficient):

SELECT machineid, MIN(start_time), MAX(end_time)
  FROM (SELECT machineid, start_time, end_time, 
               SUM(gap) over(PARTITION BY machineid 
                             ORDER BY start_time) contiguous_faults
           FROM (SELECT machineid, start_time, 
                        coalesce(end_time, DATE '9999-12-31') end_time,
                         CASE
                            WHEN start_time > MAX(coalesce(end_time, 
                                                           DATE '9999-12-31'))
                                              over(PARTITION BY machineid 
                                                   ORDER BY start_time 
                                                   ROWS BETWEEN UNBOUNDED PRECEDING
                                                            AND 1 preceding)
                            THEN 1
                         END gap
                    FROM faults))
 GROUP BY machineid, contiguous_faults
 ORDER BY 1, 2

This query starts by determining if a row is contiguous to any row that started before. We then group the rows that are contiguous.

Vincent Malgrat
Genius! Took me a while a grok it, but my admiration grew the whole time. Very elegant!Thanks.
Res Cogitans
Looks like it may work - did you test it on the data set that I suggested to najmeddine? (his test set + 1 more row)
DVK
What does "over()" do? (I'm a Sybase person, not Oracle :( )
DVK
If someone could write the same thing as Sybase syntax I'd be incredibly greatful (I sure hope this doen't use.
DVK
@DVK: I tested the query on the data set as you suggested and the additional row is correctly merged into the last segment. The OVER keyword transforms a GROUP function into a windowing function (the function is applied on the window defined after the OVER keyword). Analytics are really powerful, alas I'm not a Sybase person and I wouldn't know if something equivalent can be written in Sybase :>
Vincent Malgrat
@Res: glad you grok it =)
Vincent Malgrat
@DVK: over() isn't Oracle-specific, but rather ISO SQL:2003 specific, so it's very possible Sybase supports the over() syntax. Your DBMS has to support SQL windowing functions to have it.
sheepsimulator
+1 kudos to vincent :)
sindre j