views:

315

answers:

4

Hey, so let's say i'm building this contact management system. There is a USER table and a CONTACT_INFO table. For every USER, I can have zero or more CONTACT_INFO records. The way I've defined it, I've setup a foreign key in my CONTACT_INFO table to point to the relevant USER record.

I would like do a search for all the USER records that do not have CONTACT_INFO records.

I expect that this could be done:

SELECT * FROM user u WHERE u.user_id NOT IN (SELECT DISTINCT c.user_id FROM CONTACT_INFO);

My concern is that as the tables grow, this query's performance can degrade significantly.

One idea I'm playing with is to add a column in the USER table that says if it has any CONTACT_INFO records or not. Also, I was wondering, if upon inserting any record into CONTACT_INFO, the DBMS has to verify that the record exists, it would already be accessing that record for the sake of verification and so updating it, when I update a CONTACT_INFO record should not be that costly, performance-wise.

As always, feedback is appreciated.

+3  A: 

Easiest way is:

SELECT (...) 
FROM user u
LEFT OUTER JOIN CONTACT_INFO c
ON u.user_id = c.user_id
WHERE c.user_id IS NULL

It looks more unwieldy, but should scale better.

BradC
The optimizers I've tested treat this identically to a simple NOT EXISTS. (But it's a great technique for lots of cases anyway, and slightly more portable, which means it works in old versions of MySQL.)
le dorfier
+1  A: 

Do you have any reason to think performance will degrade? This is one of the most efficient query types in SQL. But drop the DISTINCT.

le dorfier
why should I drop the DISTINCT?
Irwin
Because it might actually compute the aggregate list of values. All it really needs to do is look up on the index and see if there's any value at all - it just needs to find the first one.
le dorfier
+2  A: 

From my tests, the following is faster than BradC's method:

select (...)
from user u
where not exists (select null from CONTACT_INFO c where u.user_id = c.user_id)

This may be because the compiler does have to do the conversion itself, I don't know.

Le Dorfier is correct in principle, though: if you have set up your indexes right on the database (i.e. both user_id columns should be indexed), both your answer and most of these responses here will be extremely fast, regardless of how many records you have in your database.

Incidentally, if you're looking for a way to get a query that lists users along with a "HasContactInfo" boolean value, you could do something like this:

select u.(...), 
  (case when exists (select null from CONTACT_INFO c where c.user_id = u.user_id) then 1
        else null
        end) has_contact_info
from user u

This second solution may not be useful in your case, but I found it to be much faster than some simpler queries that I had assumed would get optimized automatically.

StriplingWarrior
+1  A: 

At least in oracle, I obtain better performance using

where 0 = (select count(*) from CONTACT_INFO c where...)

instead of NOT IN clause.

Jonathan