views:

128

answers:

1

We are coding for a MySQL data warehousing application that stores descriptive data (User ID, Work ID, Machine ID, Start and End Time columns in the first table below) associated with time and production quantity data (Output and Time columns in the first table below) upon which aggregate (SUM, COUNT, AVG) functions are applied. We now wish to dis-aggregate time data for another type of analysis.

Our current data table design:

+---------+---------+------------+---------------------+---------------------+--------+------+
| User ID | Work ID | Machine ID | Event Start Time    | Event End Time      | Output | Time |
+---------+---------+------------+---------------------+---------------------+--------+------+
| 080025  | ABC123  | M01        | 2008-01-24 16:19:15 | 2008-01-24 16:34:45 |   2120 |  930 | 
+---------+---------+------------+---------------------+---------------------+--------+------+  

Reprocessing dis-aggregation that we would like to do would be to transform table content based on a granularity of minutes, rather than the current production event ("Event Start Time" and "Event End Time") granularity. The resulting reprocessing of existing table rows would look like:

+---------+---------+------------+---------------------+--------+  
| User ID | Work ID | Machine ID | Production Minute   | Output |
+---------+---------+------------+---------------------+--------+
| 080025  | ABC123  | M01        | 2010-01-24 16:19    |    133 |
| 080025  | ABC123  | M01        | 2010-01-24 16:20    |    133 |       
| 080025  | ABC123  | M01        | 2010-01-24 16:21    |    133 |       
| 080025  | ABC123  | M01        | 2010-01-24 16:22    |    133 |       
| 080025  | ABC123  | M01        | 2010-01-24 16:23    |    133 |       
| 080025  | ABC123  | M01        | 2010-01-24 16:24    |    133 |       
| 080025  | ABC123  | M01        | 2010-01-24 16:25    |    133 |       
| 080025  | ABC123  | M01        | 2010-01-24 16:26    |    133 |
| 080025  | ABC123  | M01        | 2010-01-24 16:27    |    133 |       
| 080025  | ABC123  | M01        | 2010-01-24 16:28    |    133 |       
| 080025  | ABC123  | M01        | 2010-01-24 16:29    |    133 |       
| 080025  | ABC123  | M01        | 2010-01-24 16:30    |    133 |       
| 080025  | ABC123  | M01        | 2010-01-24 16:31    |    133 |       
| 080025  | ABC123  | M01        | 2010-01-24 16:22    |    133 |       
| 080025  | ABC123  | M01        | 2010-01-24 16:33    |    133 |       
| 080025  | ABC123  | M01        | 2010-01-24 16:34    |    133 |       
+---------+---------+------------+---------------------+--------+  

So the reprocessing would take an existing row of data created at the granularity of production event and modify the granularity to minutes, eliminating redundant (Event End Time, Time) columns while doing so. It assumes a constant rate of production and divides output by the difference in minutes plus one to populate the new table's Output column.

I know this can be done in code...but can it be done entirely in a MySQL insert statement (or otherwise entirely in MySQL)? I am thinking of a INSERT ... INTO construction but keep getting stuck. An additional complexity is that there are hundreds of machines to include in the operation so there will be multiple rows (one for each machine) for each minute of the day.

Any ideas would be much appreciated. Thanks.

+2  A: 

You can create a table containing a row for every minute from the start of your dataset to the end, and run joins against that:

select user_id, work_id, machine_id, production_minute, output
from prod_event p
join prod_minute m on p.start <= m.production_minute and m.production_minute <= p.end;

Populating the prod_minute table can be fun:

create table counter ( i int not null auto_increment primary key );
insert into counter values ( 0 );
insert into counter select NULL from counter;
# ...  repeat until your counter table contains enough minutes

create table prod_minute ( production_minute datetime not null primary key );
insert into prod_minute select date_add( '2000-01-01', interval i minute ) from counter;
Martin
Thanks Martin! This approach would work perfectly for a single machine. An additional complexity is that we report on 219 machines, some number between 0 and 219 of which may be running simultaneously. Therefore the production_minute column could have the same value between 1 and 219 times, depending upon the number of machines running in that minute. We could add another column as the PK so as to make production_minute non-unique, but how could we repeat the same time stamp for production_minute a variable number of times based on the number of machines actually running in that minute?
lighthouse65
If all your data is in the single prod_event table, the single join should cover you for all user_id, work_id, machine_id, and output. Give it a go on a small extract - it should just work.
Martin
I see...will try it and post back...thanks again.
lighthouse65
This approach looks like it will work, but I am struggling with the join. Specifically, there is a many:many relationship between the two tables, based on the join logic: when a prod_event row spans more than one minute there are multiple prod_minute rows that join to it; and when there is more than one machine in operation during any given minute, there are multiple prod_event rows that join to the prod_minute row. The prod_event table has 5 million rows and one month has 43,000 minutes in it. Any ideas (other than dramatically shrinking the data set)?
lighthouse65
I am not sure what your question is. Are you asking for a better table design than the one shown in your question ? If so, I think you need to alter the question - or ask a new one.
Martin
Also, some column-oriented databases - InfoBright, Vertica - would store and query your final table very efficiently. Both offer trial or Open Source versions: I think you would be surprised.
Martin
Martin thank you very much for all of your help. Before moving beyond MySQL, our current dbms, we would want to push this analysis much further in MySQL and consider alternatives only if we could not get it to work there. In thinking through where our thread is going, the real issue now is the most efficient query design to join two tables with a many:many relationship between the join predicates. This is a different question than my original one and -- I think you are right -- I will pose it as a new question
lighthouse65