views:

280

answers:

6

In our application, we collect data on automotive engine performance -- basically source data on engine performance based on the engine type, the vehicle running it and the engine design. Currently, the basis for new row inserts is an engine on-off period; we monitor performance variables based on a change in engine state from active to inactive and vice versa. The related engineState table looks like this:

+---------+-----------+---------------+---------------------+---------------------+-----------------+
| vehicle | engine    | engine_state  | state_start_time    | state_end_time      | engine_variable |
+---------+-----------+---------------+---------------------+---------------------+-----------------+
| 080025  | E01       | active        | 2008-01-24 16:19:15 | 2008-01-24 16:24:45 |             720 | 
| 080028  | E02       | inactive      | 2008-01-24 16:19:25 | 2008-01-24 16:22:17 |             304 |
+---------+-----------+---------------+---------------------+---------------------+-----------------+ 

For a specific analysis, we would like to analyze table content based on a row granularity of minutes, rather than the current basis of active / inactive engine state. For this, we are thinking of creating a simple productionMinute table with a row for each minute in the period we are analyzing and joining the productionMinute and engineEvent tables on the date-time columns in each table. So if our period of analysis is from 2009-12-01 to 2010-02-28, we would create a new table with 129,600 rows, one for each minute of each day for that three-month period. The first few rows of the productionMinute table:

+---------------------+ 
| production_minute   |
+---------------------+
| 2009-12-01 00:00    |
| 2009-12-01 00:01    |
| 2009-12-01 00:02    |     
| 2009-12-01 00:03    |
+---------------------+

The join between the tables would be:

     FROM engineState AS es 
LEFT JOIN productionMinute AS pm ON pm.production_minute >= es.state_start_time 
                                AND pm.production_minute <= es.event_end_time 

This join, however, brings up multiple environmental issues:

  1. The engineState table has 5 million rows and the productionMinute table has 130,000 rows
  2. When an engineState row spans more than one minute (i.e. the difference between es.state_start_time and es.state_end_time is greater than one minute), as is the case in the example above, there are multiple productionMinute table rows that join to a single engineState table row
  3. When there is more than one engine in operation during any given minute, also as per the example above, multiple engineState table rows join to a single productionMinute row

In testing our logic and using only a small table extract (one day rather than 3 months, for the productionMinute table) the query takes over an hour to generate. In researching this item in order to improve performance so that it would be feasible to query three months of data, our thoughts were to create a temporary table from the engineEvent one, eliminating any table data that is not critical for the analysis, and joining the temporary table to the productionMinute table. We are also planning on experimenting with different joins -- specifically an inner join -- to see if that would improve performance.

What is the best query design for joining tables with the many:many relationship between the join predicates as outlined above? What is the best join type (left / right, inner)?

A: 

The performance will depend on how your data in the tables is structured.

a left or right outer join is only useful if you want all values in the left or right table for the selected projection and those values might not have something in the table being joined.

Trust your Query optimizer to find the most efficient join algorithm for your data... it was built to know how to do its job well. If you have performance issues, look at how the data is structured and stored.

Jeremy Petzold
Thanks Jeremy; but that's exactly the question I'm asking -- how should we (re-)structure and store data in tables to optimize query performance when working through a many:many relationship between join predicates and working with large datasets? Keep in mind that we are not tied to our current design because we can use temporary tables to restructure data and put indices on the join predicates...but is this an approach that's worked for others facing a similar performance challenge? If not, what are approaches that have worked?
lighthouse65
But that is not the question you asked. You asked specifically about joins.If you have a very large data set and you have multiple fields that you want to index, it is best to use B+ trees to index your fields. It will require fewer IOs in almost all cases when doing a query. I am not sure how much control MySQL gives you over the indexing techniques you can employ, but if you have choice, Choose that. If you don't have choice, then I suspect it uses B+ Trees for indexing already and specifying a field to be indexed should cover you.
Jeremy Petzold
Thanks for the repost Jeremy. I believe MySQL does allow us to specify the index type to employ. We will look at this option further an I will post back on what we find.
lighthouse65
A: 

Data retrieval performance is the function of

  • speed of access to the data on the disk (depends on existence of indexes, size of tables, size of cache, raw I/O speed)
  • number of records that need to be returned (some joins reduce the number of rows returned, some increase, some conditions can be applied on indexes some must go to the records)
  • number of columns that you need to return

For all of these you can optimize

  • adding indexes
  • reducing the size of the table by partitioning it vertically (splitting the table into two or more semantically different tables - for example if from your 5m table you actually work only with 100k records 99.5% of the time maybe you can split table into active/inactive or similar)
  • providing you can not split vertically you might split a table horizontally - number of columns that table has also influences retrieval speed (but not as much)
  • finally improving the raw I/O speed can be achieved by splitting the table transparently across multiple hard disks (but know your underlying file system properties)

Indexes have the highest impact on the performance because they can cut down disk access time and speed in memory operations by orders of magnitude (they turn O(n) to log O(n) at the cost of index structure maintenance; so they do slow down updates)

For maximum retrieval speed indexes should cover all the join and where conditions and queries should be written in such a way that query optimizer can determine which of these will yield highest benefit if executed first (highest selectivity).

For your particular example try to benchmark different combination of indexes

  1. pm.production_minute should be indexed for sure
  2. with es.state_start_time and es.state_end_time you have 4 possible index options (which you can combine):
    index on es.state_start_time
    index on es.state_end_time
    index on (es.state_start_time, es.state_end_time)
    index on (es.state_end_time, es.state_start_time)

Knowing your data would let you determine which one is optimal. I wouldn't be surprised if you would find that having the last two two column indexes would perform the best. Or having one single column and another two column index (but in reverse order of columns).

In both of these cases decent optimizer would be able to determine resultset from just reading the indexes and not even looking at the actual records and that cuts down your disk access substantially.

Unreason
A: 

My experience is that the MySQL query optimizer is pretty bad. The one in PostgreSQL is much better.

Your problem is that your data is structured for ease of recording, not for ease of analysis. My suggestion is that you go ahead and create the temporary table, but not the way you might imagine. I think that your best bet is to have post-processing step at the end of each day which takes all of the day's data and creates minute-by-minute entries into a new table (ideally on a different spindle) with a production_minute index. This new database will be faster for doing your analytical queries, and the queries won't noticeably slow down the data collection.

vy32
+1  A: 

I agree with vy32. You need to do this query once and only once to get your data in a format suitable for analysis. You should use a proper ETL tool (or heck, just perl or something simple) to get the data out of the engineState table, calculate the production minute, then load it into another DB that properly modeled for analysis type queries.

If you think your problem through you're just denormalizing your data and assigning minute numbers as surrogate keys. This is a relatively easy (and common) ETL problem which isn't performant in straight SQL but is simple with other languages and tools.

Your production volume would be easily handled by a true ETL process.

bot403
A: 

If i understood correctly, you are looking into a BI problem. A BI layout would be to have the operative data apart from the consolidated one.

For this to happen (quick and dirty) you'll need three elements.

  • Your operative data
  • An ETL job, which only needs to perform the query you've shown and insert the resultset in another denormalized table
  • Denormalized tables where you'll save your consilidated data.

This way you'll speed up your query as it would now be a simple select.

As in any BI solution, you'll need to run the ETL on daily basis (depends on your bussiness needs) to update your denormalized information.

On the other hand, you could turn down the BI way and work on your current schema/query. You could add indexes, stats, alter tables but in my opinion this is not a scalable solution. You could solve the performance problem for a three month old database, but what if you have a three year old database?

mcabral
A: 

The use of a LEFT JOIN, INNER JOIN or RIGHT JOIN is a semantic difference - using a different join for performance is not just a bad idea, it means that the relationship between tables has not been fully understood - since the different JOIN types can return different information because they mean different things.

Typically, INNER JOINs are very optimizer friendly, since this allows different filter criteria from your WHERE clause and the JOIN condition to be pushed around a lot more to improve the index scans or table scans. Referential integrity constraints can also give the optimizer information about data being guaranteed to exist on both sides.

You should review your execution plans and look at your indexing strategies. Ideally, you want narrow, covering indexes and you want to see index seeks, index scans, table scans (in order of preference) in your plans.

Typically, you want your model normalized for transaction processing and denormalized for reporting, but two models is annoying to deal with at the beginning, so you start by trying to do reporting and analysis on the normalized data, and this can work for a while with better indexes and looking at the execution plans.

When your reporting gets too poor on a well-indexed normal form, I would look into transforming the data to, perhaps, a dimensional model (have a look at Kimball's methodology) with star schemas which have very simple schemas for reporting (typically all INNER JOINs and a simple star) and can be optimized very well on traditional database systems.

Cade Roux