views:

225

answers:

6

I have a case where I wanna choose any database entry that have an invalid Country, Region, or Area ID, by invalid, I mean an ID for a country or region or area that no longer exists in my tables, I have four tables: Properties, Countries, Regions, Areas. I was thinking to do it like this:

SELECT * FROM Properties WHERE 
Country_ID NOT IN 
(
SELECT CountryID FROM Countries
)
OR
RegionID NOT IN
(
SELECT RegionID FROM Regions
)
OR
AreaID NOT IN
(
SELECT AreaID FROM Areas
)

Now, is my query right? and what do you suggest that i can do and achieve the same result with better performance?!

+3  A: 

You could rewrite it differently as follows:

SELECT p.* 
FROM Properties p
LEFT JOIN Countries c ON p.Country_ID = c.CountryID
LEFT JOIN Regions r on p.RegionID = r.RegionID
LEFT JOIN Areas a on p.AreaID = a.AreaID
WHERE c.CountryID IS NULL
OR r.RegionID IS NULL
OR a.AreaID IS NULL

Test the performance difference (if there is any - there should be as NOT IN is a nasty search, especially over a lot of items as it HAS to test every single one).

You can also make this faster by indexing the IDS being searched - in each master table (Country, Region, Area) they should be clustered primary keys.

ck
AFAIK, you would normally see little difference between clustered and non-clustered in this example.
Marc Gravell
@Marc Gravell, indeed it would be almost negligable, I just like Clustered Indexes on most tables
ck
+3  A: 

Since this seems to be cleanup sql, this should be ok. But how about using foreign keys so that it does not bother you next time around?

Learning
My thoughts exactly. The query will return the correct results and it doesn't matter if it is inefficient because you're going to use the results to clean up the data then add foreign keys.
Jamie Ide
+1  A: 

Well, you could try things like UNION (instead of OR) - but I expect that the optimizer is already doing the best it can given the information available:

SELECT  * FROM Properties
WHERE   NOT EXISTS (SELECT 1 FROM Areas WHERE Areas.AreaID = Properties.AreaID)
UNION
SELECT  * FROM Properties
WHERE   NOT EXISTS (SELECT 1 FROM Regions WHERE Regions.RegionID = Properties.RegionID)
UNION
SELECT  * FROM Properties
WHERE   NOT EXISTS (SELECT 1 FROM Countries WHERE Countries.CountryID = Properties.CountryID)
Marc Gravell
A: 

Subqueries in the conditions can be quite inefficient. Instead you can do left joins against the related tables. Where there are no matching record you get a null value. You can use this in the condition to select only the records where there is a matching record missing:

select p.*
from Properties p
left join Countries c on c.CountryID = p.Country_ID
left join Regions r on r.RegionID = p.RegionID
left join Areas a on a.AreaID = p.AreaID
where c.CountryID is null or r.RegionID is null or a.AreaID is null
Guffa
+4  A: 

Your query in fact is optimal.

LEFT JOIN's proposed by others are worse, as they select ALL values and then filter them out.

Most probably your subquery will be optimized to this:

SELECT  *
FROM    Properties p
WHERE   NOT EXISTS
        (
        SELECT  1
        FROM    Countries i
        WHERE   i.CountryID = p.CountryID
        )
        OR
        NOT EXISTS
        (
        SELECT  1
        FROM    Regions i
        WHERE   i.RegionID = p.RegionID
        )
        OR
        NOT EXISTS
        (
        SELECT  1
        FROM    Areas i
        WHERE   i.AreaID = p.AreaID
        )

, which you should use.

This query selects at most 1 row from each table, and jumps to the next iteration right as it finds this row (i. e. if it does not find a Country for a given Property, it will not even bother checking for a Region).

Again, SQL Server is smart enough to build the same plan for this query and your original one.

Update:

Tested on 512K rows in each table.

All corresponding ID's in dimension tables are CLUSTERED PRIMARY KEY's, all measure fields in Properties are indexed.

For each row in Property, PropertyID = CountryID = RegionID = AreaID, no actual missing rows (worst case in terms of execution time).

NOT EXISTS    00:11 (11 seconds)
LEFT JOIN     01:08 (68 seconds)
Quassnoi
I like you solution in that it uses positive searching rather than negative, however I would be interested to see the performance differences between the approaches.
ck
A: 

If you're not grabbing the row data from countries/regions/areas you can try using "exists":

SELECT Properties.*
FROM Properties
WHERE Properties.CountryID IS NOT NULL AND NOT EXISTS (SELECT 1 FROM Countries WHERE Countries.CountryID = Properties.CountryID)
OR Properties.RegionID IS NOT NULL AND NOT EXISTS (SELECT 1 FROM Regions WHERE Regions.RegionID = Properties.RegionID)
OR Properties.AreaID IS NOT NULL AND NOT EXISTS (SELECT 1 FROM Areas WHERE Areas.AreaID = Properties.AreaID)

This will typically hint to use the pkey indices of countries et al for the existence check... but whether that is an improvement depends on your data stats, you simply have to plug it into query analyzer and try it.

araqnid