In our application, we collect data on automotive engine performance -- basically source data on engine performance based on the engine type, the vehicle running it and the engine design. Currently, the basis for new row inserts is an engine on-off period; we monitor performance variables based on a change in engine state from active to inactive and vice versa. The related engineState
table looks like this:
+---------+-----------+---------------+---------------------+---------------------+-----------------+
| vehicle | engine | engine_state | state_start_time | state_end_time | engine_variable |
+---------+-----------+---------------+---------------------+---------------------+-----------------+
| 080025 | E01 | active | 2008-01-24 16:19:15 | 2008-01-24 16:24:45 | 720 |
| 080028 | E02 | inactive | 2008-01-24 16:19:25 | 2008-01-24 16:22:17 | 304 |
+---------+-----------+---------------+---------------------+---------------------+-----------------+
For a specific analysis, we would like to analyze table content based on a row granularity of minutes, rather than the current basis of active / inactive engine state. For this, we are thinking of creating a simple productionMinute
table with a row for each minute in the period we are analyzing and joining the productionMinute
and engineEvent
tables on the date-time columns in each table. So if our period of analysis is from 2009-12-01 to 2010-02-28, we would create a new table with 129,600 rows, one for each minute of each day for that three-month period. The first few rows of the productionMinute
table:
+---------------------+
| production_minute |
+---------------------+
| 2009-12-01 00:00 |
| 2009-12-01 00:01 |
| 2009-12-01 00:02 |
| 2009-12-01 00:03 |
+---------------------+
The join between the tables would be:
FROM engineState AS es
LEFT JOIN productionMinute AS pm ON pm.production_minute >= es.state_start_time
AND pm.production_minute <= es.event_end_time
This join, however, brings up multiple environmental issues:
- The
engineState
table has 5 million rows and theproductionMinute
table has 130,000 rows - When an
engineState
row spans more than one minute (i.e. the difference betweenes.state_start_time
andes.state_end_time
is greater than one minute), as is the case in the example above, there are multipleproductionMinute
table rows that join to a singleengineState
table row - When there is more than one engine in operation during any given minute, also as per the example above, multiple
engineState
table rows join to a singleproductionMinute
row
In testing our logic and using only a small table extract (one day rather than 3 months, for the productionMinute
table) the query takes over an hour to generate. In researching this item in order to improve performance so that it would be feasible to query three months of data, our thoughts were to create a temporary table from the engineEvent
one, eliminating any table data that is not critical for the analysis, and joining the temporary table to the productionMinute
table. We are also planning on experimenting with different joins -- specifically an inner join -- to see if that would improve performance.
What is the best query design for joining tables with the many:many relationship between the join predicates as outlined above? What is the best join type (left / right, inner)?