views:

37

answers:

1

Sorry if this is too simple, but thanks in advance for helping. This is for MySQL but might be relevant for other RDMBSs

tblA has 4 columns: colA, colB, colC, mydata, A_id It has about 10^9 records, with 10^3 distinct values for colA, colB, colC.

tblB has 3 columns: colA, colB, B_id It has about 10^4 records.

I want all the records from tblA (except the A_id) that have a match in tblB. In other words, I want to use tblB to describe the subset that I want to extract and then extract those records from tblA. Namely:

SELECT a.colA, a.colB, a.colC, a.mydata 
FROM tblA as a
INNER JOIN tblB as b
ON 
  a.colA=b.colA
  a.colB=b.colB
;

It's taking a really long time (more than an hour) on a newish computer (4GB, Core2Quad, ubuntu), and I just want to check my understanding of the following optimization steps.

** Suppose this is the only query I will ever run on these tables. So ignore the need to run other queries.

Now my questions: 1) What indexes should I create to optimize this query? I think I just need a multiple index on (colA, colB) for both tables. I don't think I need separate indexes for colA and colB. Another stack overflow article (that I can't find) mentioned that when adding new indexes, it is slower when there are existing indexes, so that might be a reason to use the multiple index.

2) Is INNER JOIN correct? I just want results where a match is found.

3) Is it faster if I join (tblA to tblB) or the other way around, (tblB to tblA)? This previous answer says that the optimizer should take care of that.

4) Does the order of the part after ON matter? This previous answer say that the optimizer also takes care of the execution order.

+1  A: 

1) What indexes should I create to optimize this query?

Indexes can help, even if the columns are only referenced in the SELECT clause. So I could create a covering index on TABLE_A, with columns in the following order:

  1. colA
  2. colB
  3. colC
  4. mydata

For TABLE_B:

  1. colA
  2. colB

Order is important, because indexes are processed from left to right to satisfy the criteria for the index to be used. IE: colC wouldn't trigger the index without colA and colB being in the query, and colB wouldn't trigger index use without a reference to colA, and so on... But colC doesn't need to be referenced to trigger the index use either.

2) Is INNER JOIN correct? I just want results where a match is found.

Yes, that is correct for your requirements.

3) Is it faster if I join (tblA to tblB) or the other way around, (tblB to tblA)?

Test and compare.

4) Does the order of the part after ON matter?

No, it doesn't matter.

OMG Ponies