views:

223

answers:

4

Here is the situation. I have one table that contains records based on records in many different tables (t1 below). t2 is one of the tables that has information pulled from it into t1.

t1
  table_oid --which table id is a FK to
  id        --fk to other table
  store_num --field

t2
  t2_id

Here is what I need to find: I need the largest t2_id where the store_num is not null in the corresponding record of t1. Here is the query I wrote:

select max(id) from t1
join t2 on t2.t2_id = t1.id
where store_num is not null
and table_oid = 1234;

However, this takes fairly long. I think this should be a fast query. all _ids have indexes for them. (t1.id/t1.table_oid, t2.t2_id). The vast majority of entries in t1 have a store_num.

Mentally, I would get the t2_ids in desc order, than one by one, try them against t1 until I found the first one that had a store_num.

select t2_id from t2 order by t2_id desc;

has an explain cost of 25612

select t1.* from t1 where table_oid = 1234
and id in (select max(t2_id) from t2);

has an explain cost of 8.

So why wouldn't the above query be a cost of at most 25612*8 = 204896? When I explain it, it comes back as more than 3 times that.

Really, my question is how do I re-write that query to run faster.

NOTE: I am using Oracle.

EDIT:

t2 has 11,895,731 rows
t1 has 473,235,192 rows

EDIT 2:

As I've tried different things, the part of the query that is taking the longest is the full scan on t1 looking for the store_num. Is there a way to keep this from doing a full scan, since I only need the biggest entry?

+2  A: 

Not sure if these apply to Oracle. Do you have an index on the fk id column for the join. Also if you can avoid the 'NOT IN' is't a non-sargable type in SQL which slows down a query.

another option that might be slower is doing an outer join then checking for null on that column. (not sure if that only applies to sql also)

select max(id) from t1
left outer join t2 on t2.t2_id = t1.id
where t1... IS NULL
and table_oid = 1234;
sadboy
I tried this, and it ended up being the same cost.
David Oneill
ahhh sorry. What about the index?
sadboy
Yes, both have indexes: t1.id (in t1) and t2.t2_id.
David Oneill
In some cases, an unnecessary full-table scan can be forced to use an index by adding an index hint.
sadboy
What's the syntax for an index hint?
David Oneill
http://www.dba-oracle.com/t_oracle_index_hint_syntax.htm
sadboy
+1  A: 

The best way I can think of to have this run fast is to:

  1. Create an index on (TABLE_OID, ID DESC, COVERED_ENTITY_ID) in that order. Why?

table_oid -- this is your primary access condition id -- so you don't have to access a data block to read it, -- and you get higher ID values first covered_entity_id -- you're filtering the data based on this, null vs not null

That should prevent the need to access the 473m rows in T1 at all.

  1. Ensure that there's an index on T2_ID.

If all that's in place, a query like:

select max(id) 
  from t1     
       inner join t2 
          on t2.t2_id = t1.id     
 where covered_entity_id is not null     
   and table_oid = 1234;   

should be (the optimizer is a finicky beast) able to do a semi-join driven by a fast full scans against the index on T1, never scanning the data blocks. Also consider writing it manaully as:

select max(id) 
  from t1     
 where covered_entity_id is not null     
   and table_oid = 1234
   and exists (select null 
                 from t2
                where t1.id = t2.t2_id);   

select max(id)
  from t1
 where covered_entity_id is not null
   and table_oid = 1234
   and id in (select t2_id from t2);

As the optimizer may write those plans slightly differently.

Adam Musch
+1  A: 

In the following I assume covered_entity_id is the same as store_num - it would really make things easier for us if you were consistent in your naming.

The vast majority of entries in t1 have a store_num.

Given that this is the case, the following clause shouldn't have any impact on the performance of your query ...

where covered_entity_id is not null

However, you go on to say

the part of the query that is taking the longest is the full scan on t1 looking for the store_num

This suggests the query is looking for covered_entity_id is not null first rather than the presumably far more selective table_oid = 1234. The solution might be as simple as re-writing the query like this ...

where table_oid = 1234 
and  covered_entity_id is not null;

... although I suspect not. You could try hinting to get the query to use the index on table_oid.

The other thing is, how fresh are the statistics? When the optimizer chooses a radically bad execution plan it is often because the stats are out of date.

Incidentally, why are you joining to T2 at all? Your requirements could be met by selecting max(id) from T1 (unless you don't have a foreign key enforcing T1.ID references T2.T2_ID, and hence need to be sure).

edit

To check your statistics run this query:

select table_name
       , num_rows
       , last_analyzed
from user_tables
where table_name in ('T1', 'T2')
/

If the results show num_rows is widely divergent from the values you gave in your first edit then you should re-gather statistics. If last_anlayzed is something like the day you went live then you definitely should re-gather. You may want to export your statistics first; refreshing the statistics can affect the execution plans (that is the object of the exercise) usually for good but sometimes things can get worse. Find out more.

APC
Yes, store_num and covered_entity_id are the same. How do I find out how fresh the statistics are? (And how do I refresh them?)
David Oneill
+1 I started writing up an explanation why I was joining to T2 at all, and I realized that the usual reason I do it doesn't apply in this case. Thx for noticing that
David Oneill
+1  A: 

You say:

all _ids have indexes for them

But your query is:

...
where store_num is not null
and table_oid = 1234;

All of your _id indexes are useless for this query unless store_num and table_oid are also indexed, and are the first columns in said index.

So of course it has to do a full scan; it can give you back max(id) instantly without any filter conditions, but as soon as you put in the filter, it can't use the id index anymore because it doesn't know which part of the index matches those store_num is not null entries - not without a scan.

To speed the query up, you need to create an index on (store_num, table_oid, id). Standard disclaimers about creating indexes for a single ad-hoc query apply; having too many indexes will hurt insert/update performance.

It really doesn't matter how you "rewrite" your query - this isn't like application code, the optimizer is going to rearrange all of the pieces of your query anyway. Unless you have sufficiently-selective indexes on your seek columns or the entire query is completely covered by a single index, it's going to be slow.

Aaronaught