views:

385

answers:

9

Hello guys, I wrote a statement that takes almost an hour to run so I am asking help so I can get to do this faster. So here we go:

I am making an inner join of two tables :

I have many time intervals represented by intervals and i want to get measure datas from measures only within those intervals.

intervals: has two columns, one is the starting time, the other the ending time of the interval (number of rows = 1295)

measures: has two columns, one with the measure, the other with the time the measure has been made (number of rows = one million)

The result I want to get is a table with in the first column the measure, then the time the measure has been done, the begin/end time of the considered interval (it would be repeated for row with a time within the considered range)

Here is my code:

select measures.measure as measure, measures.time as time, intervals.entry_time as entry_time, intervals.exit_time as exit_time
    from
    intervals
    inner join  
    measures
    on  intervals.entry_time<=measures.time  and measures.time <=intervals.exit_time  
    order by time asc

Thanks

+2  A: 

The first thing I do is have your database tool generate an execution plan that you can view (this is "Control-L" in MSSQL, but I'm not sure how to do it in Oracle) - that will try to point out the slow parts and, depending on your Server/Editor, it may even recommend some basic indexes. Once you have an execution plan, you can look for any table scans of inner loop joins, both of which are really slow - indexes can help with table scans, and you can add additional join predicates to help alleviate loop joins.

My guess would be the MEASURES needs an index on the TIME column, and you can include the MEASURE column as well to speed lookups. Try this:

CREATE INDEX idxMeasures_Time ON Measures ([Time]) INCLUDES (Measure)

Also, though this won't change your execution plan or speed up your query, it may make your join clause a bit easier read:

ON measures.time BETWEEN intervals.entry_time AND intervals.exit_time

This just combines your two <= and >= into a single statement.

rwmnau
BETWEEN is just syntactic sugar - it won't be any faster and not any slower either, really.
marc_s
That's true but rwmnau does say that: "this won't change your execution plan or speed up your query". I agree that it's easier to read with a BETWEEN clause so +1 for that.
David Aldridge
+2  A: 

You can't really optimize your statement - it's pretty simple as it is.

What you could do is investigate if some indices would help you.

You're selecting on intervals.entry_time, intervals.exit_time, measures.time - are those columns indexed?

marc_s
no I thought they were but they are not, tomorrow I should change this
+1  A: 

Your SQL is equivalent to:

select m.measure. m.time, 
     i.entry_time, i.exit_time
from intervals i
    join measures m
        on m.time Between i.entry_time And i.exit_time  
order by time asc

The only thing I might suggest is making sure there's an index on m.Time. Then if that doesn't improve performance enough, try adding indices on i.Start_Time and i.End_Time as well

Charles Bretana
yes this is what I have been suggested by most answers, I will try it tomorrow thanx charles
+15  A: 
Quassnoi
+1 Too bad I star answers as favorites. I guess I'll just have to bm the blog page. :)
Jeremy Seghi
Hi Quassnoi, I am new to this, I read your blog but it's not so simple for me, I think I would understand with a practical example. Anyway I will try what you suggested with the indexing on measures.time and see if I get it to run. As I can't modificate myself the tables and then create an index I will just be able to do it tomorrow, I will give you a feedback then, thank you!
It is really not clear to me how the b-tree indexing would get it faster...
`@iracema78280`: since you have a non-equality condition in your join, the only way to perform this join is `NESTED LOOPS`. The engine will have to take each record from the outermost table (`intervals`) and find all matching records in the innermost table (`measures`). If you have a `B-Tree` index on `time`, this index will be used to constrain the number of records examined in each nested loop. Instead or parsing `1,000,000` records in each loop and filtering out the non-matching records, the engine will just find all time's within the interval bounds using the index.
Quassnoi
A sort-merge ought to be efficient though. The sort on `intervals` would be trivial and a composite index on (time,measure) would reduce the cost of the sort on the measures data to that of a full index scan. I'd try that as well as an index method.
David Aldridge
`@David Aldridge`: `MERGE JOIN` cannot be used here: it's not an equijoin. The index on `measures` should be rewound each time new `interval` is evaluated.
Quassnoi
A sort-merge isn't restricted to equi-joins. http://download.oracle.com/docs/cd/B19306_01/server.102/b14211/optimops.htm#i49183
David Aldridge
Im afraid that the more I think about a nested loop being driven from the intervals table the less I like it. Assuming every measure matches one interval (the best case scenario I think) this means that for every 1295 rows in `intervals` we lookup 772 rows in `measures` -- something like 1,500 consistent gets. That's nearly two million consistent gets and a test on 10.2.0.4 shows an estimated cost of 242k for the NL vs 12k for the sort merge.
David Aldridge
Hmm, actually that estimation of the consistent gets is probably too high, but I think it's still way higher than the sort-merge.
David Aldridge
`@David Aldridge`: sorry, I meant "not a single-column join" here actually. `MERGE` join can actually be forced to be used here but will require an additional `FILTER` condition which will kill its performance. It will be optimized to a `MERGE JOIN` on `measure.time >= intervals.entry_time` with additional filtering on `measure.time <= intervals.exit_time` (or vice versa). In both cases, too many rows will need to be returned before filtering.
Quassnoi
`@David Aldridge`: please see the post update for performance comparisons.
Quassnoi
The sort-merge join is indeed implemented as an access based on `entry_time` with a filter on `exit_time` (actually there's a filter on entry_time also in my tests), but removing the exit_time clause does not alter the cost estimation for the query. That doesn't surprise me though because it's pure CPU time that is insignificant compared to the cost of sorts or of consistent gets. The optimiser actually prefers the sort-merge in the tests I've run.
David Aldridge
`@David Aldridge`: in average, the `MERGE JOIN` query will need to make `(1,265 * 1,000,000) / 2 ~= 600M` filters. I wouldn't call this "insignificant" and wouldn't trust the cost estimation in this case at all. The optimizer does not take into account the fact that the intervals do not intersect.
Quassnoi
OK, I'll look into that some more. I see that your data setup is a bit different to mine -- not all of the measures will join to an interval (50% of them do not), and the time is a PK on measures. I'm not sure whether those are significant or not though. What plan does your optimizer choose without any hints (not that I take that as the gold-standard you understand)?
David Aldridge
`@David Aldridge`: if all measures did join, this would just make the query to run `2` times as long (which will give us `3.5` seconds). The `PK` index on `time` is not used in any way: the index on `(time, measure)` is used instead. You can safely drop it. Without the hints, the optimizer chooses `NESTED LOOPS` with the table order swapped: `measure` leading and `intervals` driven.
Quassnoi
What's your version? CPU costing enabled?
David Aldridge
`Oracle Database 10g Express Edition Release 10.2.0.1.0 - Product`. Yes, the `CPU` costing is enabled.
Quassnoi
+1 for Nested Loops
APC
A: 

Not knowing what database system and version, I'd say that (lack of) indexing and the join clause could be causing the problem.

For every record in the measure table, you can have multiple records in the interval table (intervals.entry_time<=measures.time), and for every record in the interval table, you can have multiple records in measure (measures.time <=intervals.exit_time). the resulting one-to-many and many-to one relationships cause by the join means multiple table scans for each record. I doubt that Cartesian Product is the correct term, but it's pretty close.

Indexing would definitely help, but it would help even more if you could find a better key to join the two tables. having the one-to-many relationships going in one direction only would definitely speed up the processing as it wouldn't have to scan each table/index twice for each record.

Jeremy Seghi
"Indexing would definitely help"Twaddle. Under the SQL EVERY row in measures and intervals needs to be processed. With only 1000(ish) rows in Intervals, its probably only a handful of blocks and wouldn't benefit from indexing as there's a good chance the entire index would need to be scanned beccause the criteria is on two columns. Trying to read the entire Measures table through multiple index scans would probably slow it down.
Gary
I agree with you Jeremy for the fact that I should try to restrict it to a one-to-many relationship but I tried and with my lack of experience I didn't find a way. I am definately trying the index method thanx
`@Gary`: every row in `measures` and `intervals` should of course be processed. But using the indexes each row will only need to be processed *once*. Without the indexes, each row would need to be processed many times, as in `CARTESIAN JOIN`. So yes, indexing would definitely help.
Quassnoi
@Gary Every response has mentioned indexing in one way or another; The point I was trying to make was the Cartesian Product caused by the join was contributing a contributing factor.
Jeremy Seghi
+2  A: 

try a parallel query

  alter session enable parallel query;

  select /*+ parallel */ ... same as before;

You could also create a materialized view, perhaps with the parallel hint above. It may take a long time to create the MV but once created it can be queried repeatedly.

Hi redcayuga, can you explain me in more details how parallel queries work and would be applied in that case please?
Enabling parallel query tells Oracle to break up a big query into pieces and run those pieces in parallel (multi-threaded or multi-tasked or whatever.) Even if you have only one cpu you may benefit from PQ; while some threads are blocked try to read the disk a thread could using the cpu to process what's in memory.To use it, issue the "alter session" command once and put the /*+ parallel */ hint right after the select. No space between the /* and the + sign. Oracle should automatically use PQ.Are these tables analyzed?
+1  A: 

You're pretty much going to get most of the rows from both tables in this case, plus you've got a sort.

The question is, does the calling process really need all the rows, or just the first few? This would change how I'd go about optimising the query.

I'll assume your calling process wants ALL the rows. Since the join predicate is not on an equality, I'd say a MERGE JOIN may be the best approach to aim for. A merge join requires its data sources to be sorted, so if we can avoid a sort the query should run as fast as it possibly can (barring more interesting approaches such as specialised indexes or materialized views).

To avoid the SORT operations on intervals and measures, you could add indexes on (measures.time,measures.measure) and (intervals.entry_time, intervals.exit_time). The database can use the index to avoid a sort, and it'll be faster because it doesn't have to visit any table blocks.

Alternatively, if you only have an index on measures.time, the query may still run ok without adding another big index - it'll run slower though because it'll probably have to read many table blocks to get the measures.measure for the SELECT clause.

Jeffrey Kemp
+1: Yeah I like a sort-merge for this. Difficult to tell without execution plans though.
David Aldridge
+3  A: 

To summarise: your query is running against the full set of MEASURES. It matches the time of each MEASURES record to an INTERVALS record. If the window of times spanned by INTERVALS is roughly similar to the window spanned by MEASURES then your query is also running against the full set of INTERVALS, otherwise it is running against a subset.

Why that matter is because it reduces your scope for tuning, as a full table scan is the likely to be the fastest way of getting all the rows. So, unless your real MEASURES or INTERVALS tables have a lot more columns than you give us, it is unlikely that any indexes will give much advantage.

The possible strategies are:

  • no indexes at all
  • index on MEASURES (TIME,MEASURE)
  • index on MEASURES (TIME)
  • no index on MEASURES
  • index on INTERVALS (ENTRY_TIME, EXIT_TIME)
  • index on INTERVALS (ENTRY_TIME)
  • no index on INTERVALS
  • parallel query

I'm not going to present test cases for all the permutations, because the results are pretty much as we would expect.

Here is the test data. As you can see I'm using slightly larger data sets. The INTERVALS window is bigger than the MEASURES windows but not by much. The intervals are 10000 seconds wide, and the measures are taken every 15 seconds.

SQL> select min(entry_time), max(exit_time), count(*) from intervals;

MIN(ENTRY MAX(EXIT_   COUNT(*)
--------- --------- ----------
01-JAN-09 20-AUG-09       2001

SQL> select min(ts), max(ts), count(*) from measures;

MIN(TS)   MAX(TS)     COUNT(*)
--------- --------- ----------
02-JAN-09 17-JUN-09    1200001

SQL>

NB In my test data I have presumed that INTERVAL records do not overlap. This has an important corrolary: a MEASURES record joins to only one INTERVAL.

Benchmark

Here is the benchmark with no indexes.

SQL> exec dbms_stats.gather_table_stats(user, 'MEASURES', cascade=>true)

PL/SQL procedure successfully completed.

SQL> exec dbms_stats.gather_table_stats(user, 'INTERVALS', cascade=>true)

PL/SQL procedure successfully completed.

SQL> set timing on
SQL> 
SQL> select m.measure
  2         , m.ts as "TIME"
  3         , i.entry_time
  4         , i.exit_time
  5  from
  6      intervals i
  7  inner join
  8      measures m
  9      on ( m.ts between  i.entry_time and i.exit_time )
 10  order by m.ts asc
 11  /

1200001 rows selected.

Elapsed: 00:05:37.03

SQL>

MEASURES tests

Now let's build a unique index on INTERVALS (ENTRY_TIME, EXIT_TIME) and try out the various indexing strategies for MEASURES. First up, an index MEASURES TIME column only.

SQL> create index meas_idx on measures (ts)
  2  /

Index created.

SQL> exec dbms_stats.gather_table_stats(user, 'MEASURES', cascade=>true)

PL/SQL procedure successfully completed.

SQL> 
SQL> set autotrace traceonly exp
SQL> 
SQL> set timing on
SQL> 
SQL> select m.measure
  2         , m.ts as "TIME"
  3         , i.entry_time
  4         , i.exit_time
  5  from
  6      intervals i
  7  inner join
  8      measures m
  9      on ( m.ts between  i.entry_time and i.exit_time )
 10  order by m.ts asc
 11  /

1200001 rows selected.

Elapsed: 00:05:20.21

SQL>

Now, let us index MEASURES.TIME and MEASURE columns

SQL> drop  index meas_idx
  2  /

Index dropped.

SQL> create index meas_idx on measures (ts, measure)
  2  /

Index created.

SQL> exec dbms_stats.gather_table_stats(user, 'MEASURES', cascade=>true)

PL/SQL procedure successfully completed.


SQL> select m.measure
  2         , m.ts as "TIME"
  3         , i.entry_time
  4         , i.exit_time
  5  from
  6      intervals i
  7  inner join
  8      measures m
  9      on ( m.ts between  i.entry_time and i.exit_time )
 10  order by m.ts asc
 11  /

1200001 rows selected.

Elapsed: 00:05:28.54

SQL>

Now with no index on MEASURES (but still an index on INTERVALS)

SQL> drop  index meas_idx
  2  /

Index dropped.

SQL> exec dbms_stats.gather_table_stats(user, 'MEASURES', cascade=>true)

PL/SQL procedure successfully completed.

SQL> select m.measure
  2         , m.ts as "TIME"
  3         , i.entry_time
  4         , i.exit_time
  5  from
  6      intervals i
  7  inner join
  8      measures m
  9      on ( m.ts between  i.entry_time and i.exit_time )
 10  order by m.ts asc
 11  /

1200001 rows selected.

Elapsed: 00:05:24.81

SQL>

So what difference does parallel query make ?

SQL> select /*+ parallel (4) */
  2         m.measure
  3         , m.ts as "TIME"
  4         , i.entry_time
  5         , i.exit_time
  6  from
  7      intervals i
  8  inner join
  9      measures m
 10      on ( m.ts between  i.entry_time and i.exit_time )
 11  order by m.ts asc
 12  /

1200001 rows selected.

Elapsed: 00:02:33.82


SQL>

MEASURES Conclusion

Not much difference in the elapsed time for the different indexes. I was slightly surprised that building an index on MEASURES (TS, MEASURE) resulted in a full table scan and a somewhat slower execution time. On the other hand, it is no surprise that running in parallel query is much faster. So if you have Enterprise Edition and you have the CPUs to spare, using PQ will definitely reduce the elapsed time, although it won't change the resource costs much (and actually does a lot more sorting).

INTERVALS tests

So what difference might the various indexes on INTERVALS make? In the following tests we will retain an index on MEASURES (TS). First of all we will drop the primary key on both INTERVALS columns and replace it with a constraint on INTERVALS (ENTRY_TIME) only.

SQL> alter table intervals drop  constraint ivl_pk drop  index
  2  /

Table altered.

SQL> alter table intervals add  constraint ivl_pk primary key (entry_time) using  index
  2  /

Table altered.

SQL> exec dbms_stats.gather_table_stats(user, 'INTERVALS', cascade=>true)

PL/SQL procedure successfully completed.


SQL> select m.measure
  2         , m.ts as "TIME"
  3         , i.entry_time
  4         , i.exit_time
  5  from
  6      intervals i
  7  inner join
  8      measures m
  9      on ( m.ts between  i.entry_time and i.exit_time )
 10  order by m.ts asc
 11  /

1200001 rows selected.

Elapsed: 00:05:38.39

SQL>

Lastly with no index on INTERVALS at all

SQL> alter table intervals drop  constraint ivl_pk drop  index
  2  /

Table altered.

SQL> exec dbms_stats.gather_table_stats(user, 'INTERVALS', cascade=>true)

PL/SQL procedure successfully completed.

SQL> select m.measure
  2         , m.ts as "TIME"
  3         , i.entry_time
  4         , i.exit_time
  5  from
  6      intervals i
  7  inner join
  8      measures m
  9      on ( m.ts between  i.entry_time and i.exit_time )
 10  order by m.ts asc
 11  /

1200001 rows selected.

Elapsed: 00:05:29.15

SQL>

INTERVALS conclusion

The index on INTERVALS makes a slight difference. That is, indexing (ENTRY_TIME, EXIT_TIME) results in a faster execution. This is because it permist a fast full index scan rather than a full table scan. This would be more significant if the time window delineated by INTERVALS was considerably wider than that of MEASURES.

Overall Conclusions

Because we are doing full table queries none of the indexes substantially changed the execution time. So if you have Enterprise Edition and multiple CPUs Parallel Query will give you the best results. Otherwise the most best indexes would be INTERVALS(ENTRY_TIME, EXIT_TIME) and MEASURES(TS) . The Nested Loops solution is definitely faster than Parallel Query - see Edit 4 below.

If you were running against a subset of MEASURES (say a week's worth) then the presence of indexes would have a bigger impact, It is likely that the two I recommended in the previous paragraph would remain the most effective,

Last observation: I ran this on a bog standard dual core laptop with an SGA of just 512M. Yet all of my queries took less than six minutes. If your query really takes an hour then your database has some serious problems. Although this long running time could be an artefact of overlapping INTERVALS, which could result in a cartesian product.

*Edit *

Originally I included the output from

SQL> set autotrace traceonly stat exp

But alas SO severely truncated my post. So I have rewritten it but without execution or stats. Those who wish to validate my findings will have to run the queries themselevs.

Edit 4 (previous edit's removed for reasons of space)

At the third attempt I have been able to reproduce teh performance improvement for Quassnoi's solution.

SQL> set autotrace traceonly stat exp
SQL>
SQL> set timing on
SQL>
SQL> select
  2          /*+ LEADING (i) USE_NL(i, m) */
  3              m.measure
  4             , m.ts as "TIME"
  5             , i.entry_time
  6             , i.exit_time
  7  from
  8      intervals i
  9  inner join
 10      measures m
 11      on ( m.ts between  i.entry_time and i.exit_time )
 12  order by m.ts asc
 13  /

1200001 rows selected.

Elapsed: 00:00:18.39

Execution Plan
----------------------------------------------------------
Plan hash value: 974071908

---------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name      | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |           |  6003K|   257M|       |   973K  (1)| 03:14:46 |
|   1 |  SORT ORDER BY                |           |  6003K|   257M|   646M|   973K  (1)| 03:14:46 |
|   2 |   NESTED LOOPS                |           |       |       |       |            |          |
|   3 |    NESTED LOOPS               |           |  6003K|   257M|       |   905K  (1)| 03:01:06 |
|   4 |     TABLE ACCESS FULL         | INTERVALS |  2001 | 32016 |       |  2739   (1)| 00:00:33 |
|*  5 |     INDEX RANGE SCAN          | MEAS_IDX  | 60000 |       |       |   161   (1)| 00:00:02 |
|   6 |    TABLE ACCESS BY INDEX ROWID| MEASURES  |  3000 | 87000 |       |   451   (1)| 00:00:06 |
---------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   5 - access("M"."TS">="I"."ENTRY_TIME" AND "M"."TS"<="I"."EXIT_TIME")


Statistics
----------------------------------------------------------
         66  recursive calls
          2  db block gets
      21743  consistent gets
      18175  physical reads
          0  redo size
   52171689  bytes sent via SQL*Net to client
     880416  bytes received via SQL*Net from client
      80002  SQL*Net roundtrips to/from client
          0  sorts (memory)
          1  sorts (disk)
    1200001  rows processed

SQL>

So Nested Loops are definitely the way to go.

Useful lessons from the exercise

  1. Running diagnostic tests is far more valuable than guessing and theorising
  2. Understanding the data is crucial
  3. Even with 11g we still soemtimes need to use hints to prod the optimizer in certain cases
APC
Yeah, my thought on this was that a sort-merge would be the answer, and the trick would be to avoid the sort on the measures table spilling to disk. However an index on measures(time, measure) could be full scanned to give the correct order both for the sort-merge and the required order by, so I'm surprised that it didn't do better -- how was the execution plan for that? (I'd run it myself but I'm too lazy to build the initial data sets ... did I say "lazy"? I meant "busy").
David Aldridge
@David. The execution plan showed a FULL TABLE SCAN on MEASURES when I used that index, which I really don't understand. If I wasn't so, er, *busy* myself I would run a 10053 trace to see why. If I get the opportunity I might yet do so, but breakfast beckons, and then the day job after that.
APC
Yeah -- with a skinny measures table I'd not be too surprised if the optimiser did punt on the index scan and go for an FTS, what with the index having rowid's in it as well ... that probably makes the index twice the size of the table and the optimiser weighs up scanning and sorting the table (multiblock) vs reading the index (single block reads of twice the segment size). A win for the FTS even with a single pass sort on the data.
David Aldridge
`@APC`: try forcing the `NESTED LOOPS` with `intervals` leading and `measures` driven: `/*+ LEADING (intervals) USE_NL(intervals, measures) */`
Quassnoi
Do you have an index on `measures.time`? Did you check the execution plan and make sure that `NESTED LOOPS` with `intervals` leading, `measures` driven and `INDEX RANGE SCAN` on the `measures` was used? If you alias the table in the query, you should use the aliases in the hint as well.
Quassnoi
+1  A: 

There may be a very efficient way of writing this query if the intervals are deterministic because the query could be converted to an equi-join that would be amenable to more efficient hash joining.

For example if the intervals are all hourly:

ENTRY_TIME          EXIT_TIME
2000-01-15 09:00:00 2000-01-15 09:59:59
2000-01-15 10:00:00 2000-01-15 10:59:59
2000-01-15 11:00:00 2000-01-15 11:59:59
2000-01-15 12:00:00 2000-01-15 12:59:59
....

Then the join can be written as:

intervals.entry_time=trunc(measures.time,'HH')

This would reduce the cost of everything up to and including the join pretty much to a full scan of each of the tables.

However, since you have the ORDER BY operation in there, I think that a sort-merge might still beat it as the query is written right now because the optimiser will sort a smaller data set for the sort-merge than it would for the hash join (because in the latter case it would have to sort more columns of data). you could get round this by structuring the query as:

select
  measures.measure     as measure,
  measures.time        as time,
  intervals.entry_time as entry_time,
  intervals.exit_time  as exit_time
from
  intervals inner join  
  (select time, measure from measures order by time) measures
  on  intervals.entry_time=trunc(measures.time,'HH')  
/

This gives a lower cost estimate than a sort-merge on my 10.2.0.4 test instance but I'd regard it as a little risky.

So, I'd look for a sort-merge or rewrite it to allow the use of a hash join if possible.

David Aldridge