tags:

views:

62

answers:

4

I have a table structure similar the following:

create table MAIL (
  ID        int,
  FROM      varchar,
  SENT_DATE date
);

create table MAIL_TO (
  ID      int,
  MAIL_ID int,
  NAME      varchar
);

and I need to run the following query:

select m.ID 
from MAIL m 
  inner join MAIL_TO t on t.MAIL_ID = m.ID
where m.SENT_DATE between '07/01/2010' and '07/30/2010'
  and t.NAME = '[email protected]'

Is there any way to design indexes such that both of the conditions can use an index? If I put an index on MAIL.SENT_DATE and an index on MAIL_TO.NAME, the database will choose to use either one of the indexes or the other, not both. After filtering by the first condition the database always has to do a full scan of the results for the second condition.

+2  A: 

A materialized view would allow you to index the values, assuming the stringent materialized view criteria is met.

OMG Ponies
I assume that the application will need to be modified to refer to the materialized view instead?
jthg
@jthg: Yes, it's a view - just faster/more efficient than an ordinary view.
OMG Ponies
A: 

Which criterion is more selective? The date range or the addressee? I would guess the addressee. And if that is highly selective, do not care for the date index, just let the database do the search based on the found mail ids. But index table MAIL on the id if it is not already.

On the other hand, some modern optimizers would even make use of both indexes, scanning both tables and than build a hash value of the join columns to merge the results of both. I am not absolutely sure if and when Oracle would chose this strategy. I just realized that SQL Server tends to make hash joins rather often, compared to other engines.

Frank
Thanks for the info. I can't see though how doing a hash join would make the select faster. It will still need to do a full scan of the results from each condition.
jthg
I was assuming the optimizer would use both indexes to restrict the two inputs to the hash join. Of course, records with the correct addressee and the wrong date and vice versa would still be in the inputs of these, but not any records for other addressees from MAIL_TO or out of date range from MAIL.
Frank
+4  A: 

Oracle can use both indices. You just don't have the right two indices.

Consider: if the query plan uses your index on mail.sent_date first, what does it get from mail? It gets all the mail.ids where mail.sent_date is within the range you gave in your where clause, yes?

So it goes to mail_to with a list of mail.ids and the mail.name you gave in your where clause. At this point, Oracle decides that it's better to scan the table for matching mail_to.mail_ids rather than use the index on mail_to.name.

Indices on varchars are always problematic, and Oracle really prefers full table scans. But if we give Oracle an index containing the columns it really wants to use, and depending on total table rows and statistics, we can get it to use it. This is the index:

 create index mail_to_pid_name on mail_to( mail_id, name ) ; 

This works where an index just on name doesn't, because Oracle's not looking just for a name, but for a mail_id and a name.

Conversely, if the cost-based analyzer determines it's cheaper to go to table mail_to first, and uses your index on mail_to.name, what doe sit get? A bunch of mail_to_.mail_ids to look up in mail. It needs to find rows with those ids and certain sent_dates, so:

 create index mail_id_sentdate on mail( sent_date, id ) ; 

Note that in this case I've put sent_date first in the index, and id second. (This is more an intuitive thing.)

Again, the take home point is this: in creating indices, you have to consider not just the columns in your where clause, but also the columns in your join conditions.


Update

jthg: yes, it always depends on how the data is distributed. And on how many rows are in the table: if very many, Oracle will do a table scan and hash join, if very few it will do a table scan. You might reverse the order of either of the two indices. By putting sent_date first in the second index, we eliminate most needs for an index solely on sent_date.

tpdi
Very well explained.
Marcus Adams
Thanks, but wouldn't the two indexes that you suggested be not very helpful? The first index would still force the database to try every MAIL_ID within the sent_date range, it just saves the time of scanning the few MAIL_TO's associated with each MAIL. The second index is no different from an index on just sent_date since no two sent_dates are going to be exactly the same.
jthg
@jthg: Oracle, as most modern engines, can make use of columns in an index and not in a table by avoiding to access the table at all:If it scans an index and finds rowIds, and then has to go back to the disk to fetch the table pages containing these rowIds, that is a lot of effort, especially if the pages are widely spread on the disk. If the second column searched is part of the index, this additional disk access is avoided. Add to that the smaller record size of indexes (normally much fewer columns), and hence the data is spread over less pages, and hence there is less I/O this way.
Frank
I see what you mean. The optimizations you suggested should provide a constant factor speed up in the queries - which is definitely useful in many situations. I was hoping for an improvement in the big O performance of the queries though and I think OMG Ponies suggestion of using a materialized view would provide that.
jthg
A: 

If your queries are generally for a particular month, then you could partition the data by month.

Marcus Adams
Unfortunately, that's not the case. The range is arbitrary.
jthg