ansaurus

Question

A messy SQL statement

Answer 1

+3 A:

You could rewrite it differently as follows:

SELECT p.* 
FROM Properties p
LEFT JOIN Countries c ON p.Country_ID = c.CountryID
LEFT JOIN Regions r on p.RegionID = r.RegionID
LEFT JOIN Areas a on p.AreaID = a.AreaID
WHERE c.CountryID IS NULL
OR r.RegionID IS NULL
OR a.AreaID IS NULL

Test the performance difference (if there is any - there should be as NOT IN is a nasty search, especially over a lot of items as it HAS to test every single one).

You can also make this faster by indexing the IDS being searched - in each master table (Country, Region, Area) they should be clustered primary keys.

ck 2009-04-14 11:43:06

AFAIK, you would normally see little difference between clustered and non-clustered in this example.

Marc Gravell 2009-04-14 11:55:08

@Marc Gravell, indeed it would be almost negligable, I just like Clustered Indexes on most tables

ck 2009-04-14 12:01:24

Answer 2

+3 A:

Since this seems to be cleanup sql, this should be ok. But how about using foreign keys so that it does not bother you next time around?

Learning 2009-04-14 11:44:07

My thoughts exactly. The query will return the correct results and it doesn't matter if it is inefficient because you're going to use the results to clean up the data then add foreign keys.

Jamie Ide 2009-04-14 11:50:35

Answer 3

+1 A:

Well, you could try things like UNION (instead of OR) - but I expect that the optimizer is already doing the best it can given the information available:

SELECT  * FROM Properties
WHERE   NOT EXISTS (SELECT 1 FROM Areas WHERE Areas.AreaID = Properties.AreaID)
UNION
SELECT  * FROM Properties
WHERE   NOT EXISTS (SELECT 1 FROM Regions WHERE Regions.RegionID = Properties.RegionID)
UNION
SELECT  * FROM Properties
WHERE   NOT EXISTS (SELECT 1 FROM Countries WHERE Countries.CountryID = Properties.CountryID)

Marc Gravell 2009-04-14 11:44:36

Answer 4

A:

Subqueries in the conditions can be quite inefficient. Instead you can do left joins against the related tables. Where there are no matching record you get a null value. You can use this in the condition to select only the records where there is a matching record missing:

select p.*
from Properties p
left join Countries c on c.CountryID = p.Country_ID
left join Regions r on r.RegionID = p.RegionID
left join Areas a on a.AreaID = p.AreaID
where c.CountryID is null or r.RegionID is null or a.AreaID is null

Guffa 2009-04-14 11:46:28

Answer 5

+4 A:

Your query in fact is optimal.

LEFT JOIN's proposed by others are worse, as they select ALL values and then filter them out.

Most probably your subquery will be optimized to this:

SELECT  *
FROM    Properties p
WHERE   NOT EXISTS
        (
        SELECT  1
        FROM    Countries i
        WHERE   i.CountryID = p.CountryID
        )
        OR
        NOT EXISTS
        (
        SELECT  1
        FROM    Regions i
        WHERE   i.RegionID = p.RegionID
        )
        OR
        NOT EXISTS
        (
        SELECT  1
        FROM    Areas i
        WHERE   i.AreaID = p.AreaID
        )

, which you should use.

This query selects at most 1 row from each table, and jumps to the next iteration right as it finds this row (i. e. if it does not find a Country for a given Property, it will not even bother checking for a Region).

Again, SQL Server is smart enough to build the same plan for this query and your original one.

Update:

Tested on 512K rows in each table.

All corresponding ID's in dimension tables are CLUSTERED PRIMARY KEY's, all measure fields in Properties are indexed.

For each row in Property, PropertyID = CountryID = RegionID = AreaID, no actual missing rows (worst case in terms of execution time).

NOT EXISTS    00:11 (11 seconds)
LEFT JOIN     01:08 (68 seconds)

Quassnoi 2009-04-14 11:51:25

I like you solution in that it uses positive searching rather than negative, however I would be interested to see the performance differences between the approaches.

ck 2009-04-14 12:03:44

Answer 6

A:

If you're not grabbing the row data from countries/regions/areas you can try using "exists":

SELECT Properties.*
FROM Properties
WHERE Properties.CountryID IS NOT NULL AND NOT EXISTS (SELECT 1 FROM Countries WHERE Countries.CountryID = Properties.CountryID)
OR Properties.RegionID IS NOT NULL AND NOT EXISTS (SELECT 1 FROM Regions WHERE Regions.RegionID = Properties.RegionID)
OR Properties.AreaID IS NOT NULL AND NOT EXISTS (SELECT 1 FROM Areas WHERE Areas.AreaID = Properties.AreaID)

This will typically hint to use the pkey indices of countries et al for the existence check... but whether that is an improvement depends on your data stats, you simply have to plug it into query analyzer and try it.

araqnid 2009-04-14 11:52:17

ansaurus

tags:

views:

answers:

A messy SQL statement

related questions