I have a table called users with roughly 250,000 records in it. I have another table called staging with around 75,000 records in it. Staging only has one column, msisdn. I want to check to see how many rows in staging are not present in users.
I have the following query, which I have tested on a small data subset, and it seems to work fine:
SELECT
s.*
FROM staging s
LEFT OUTER JOIN users u ON u.msisdn=s.msisdn
WHERE u.msisdn IS NULL
The problem however, is when I try to run this query on the full list of 250k users. It ran for an hour before I stopped it. Is there any way I can optimise this query?
I have started running the query on subsets of the data in staging, but this is horribly manual:
SELECT
s.*
FROM staging s
LEFT OUTER JOIN users u ON u.msisdn=s.msisdn
WHERE u.msisdn IS NULL
LIMIT 0,10000
msisdn is the primary key of the staging table, but it's not the primary key of the table users. I don't know if that is significant however.