views:

322

answers:

5

I have a table called users with roughly 250,000 records in it. I have another table called staging with around 75,000 records in it. Staging only has one column, msisdn. I want to check to see how many rows in staging are not present in users.

I have the following query, which I have tested on a small data subset, and it seems to work fine:

SELECT
    s.*
    FROM staging s
        LEFT OUTER JOIN users u ON u.msisdn=s.msisdn
        WHERE u.msisdn IS NULL

The problem however, is when I try to run this query on the full list of 250k users. It ran for an hour before I stopped it. Is there any way I can optimise this query?

I have started running the query on subsets of the data in staging, but this is horribly manual:

SELECT
    s.*
    FROM staging s
        LEFT OUTER JOIN users u ON u.msisdn=s.msisdn
        WHERE u.msisdn IS NULL
    LIMIT 0,10000

msisdn is the primary key of the staging table, but it's not the primary key of the table users. I don't know if that is significant however.

+1  A: 

Put indexes on the msisdn columns of each table. Since it's not a PK on users, you'll need to put a non-clustered index on it. That should speed up your query tremendously.

Eric
A: 

I'm not sure how much quicker this will be, but you can try something like.

select msisdn
from staging
where msisdn not in (select msisdn from users)

Also, make sure that an index exists for the msisdn column in both tables. That should speed things up tremendously.

Michael Todd
+2  A: 

First, you can see what indices MySQL is using with the EXPLAIN command. Just write EXPLAIN in front of your query, and the results will show what index (if any) it's using. Presumably if it's that slow on so (relatively) small a data set as 250,000 records, it's not exploiting a very effective index, and you'll be able to see where.

It may also help to rewrite the query as a NOT EXISTS like so:

SELECT s.* FROM staging s
WHERE NOT EXISTS (SELECT 1 FROM users WHERE users.misdn = s.misdn)
VoteyDisciple
+1  A: 

Things you can do to speed up this query:

  • make sure msisdn is indexed in both tables
  • optimize your tables
  • replace * with msisdn
tharkun
A: 

Thanks for this, added an index to two of my fields and speeded my query up from 40 secs to sub 1 sec. Thanks you!

Mike Wilson