Sorry if this is too simple, but thanks in advance for helping. This is for MySQL but might be relevant for other RDMBSs
tblA has 4 columns: colA, colB, colC, mydata, A_id It has about 10^9 records, with 10^3 distinct values for colA, colB, colC.
tblB has 3 columns: colA, colB, B_id It has about 10^4 records.
I want all the records from tblA (except the A_id) that have a match in tblB. In other words, I want to use tblB to describe the subset that I want to extract and then extract those records from tblA. Namely:
SELECT a.colA, a.colB, a.colC, a.mydata
FROM tblA as a
INNER JOIN tblB as b
ON
a.colA=b.colA
a.colB=b.colB
;
It's taking a really long time (more than an hour) on a newish computer (4GB, Core2Quad, ubuntu), and I just want to check my understanding of the following optimization steps.
** Suppose this is the only query I will ever run on these tables. So ignore the need to run other queries.
Now my questions: 1) What indexes should I create to optimize this query? I think I just need a multiple index on (colA, colB) for both tables. I don't think I need separate indexes for colA and colB. Another stack overflow article (that I can't find) mentioned that when adding new indexes, it is slower when there are existing indexes, so that might be a reason to use the multiple index.
2) Is INNER JOIN correct? I just want results where a match is found.
3) Is it faster if I join (tblA to tblB) or the other way around, (tblB to tblA)? This previous answer says that the optimizer should take care of that.
4) Does the order of the part after ON matter? This previous answer say that the optimizer also takes care of the execution order.