views:

74

answers:

4

I have an insert that uses a condition checking for a NOT IN. There are about 230k rows in the NOT IN subquery.

INSERT INTO Validate.ItemError (ItemId, ErrorId, DateCreated) 
(
    SELECT ItemId, 10, GetUTCDate() 
    FROM Validate.Item 
    INNER JOIN Refresh.Company 
    ON Validate.Item.IMCompanyId = Refresh.Company.IMCompanyId 
    WHERE Refresh.Company.CompanyId = 14 
    AND 
    (
        IMAccountId IS NULL OR NOT IMAccountId IN
        (
            SELECT RA.IMAccountId 
            FROM Refresh.Account RA 
            INNER JOIN Refresh.BalancePool BP 
            ON RA.BalancePoolId = BP.BalancePoolId 
            WHERE BP.CompanyId = 14
        )
    )
)

When I run this as-is it takes about 30+ minutes (yikes!). The number of values in the Validate.Item table could be anywhere from 150 rows to over 200k, so you can see how this might be a pain.

There are indices on all the relevant fields in the tables, and none are excessively fragmented.

My first thought was to do this in pieces, and throw it into a WHILE loop:

DECLARE @StartId int, @EndId int, @MaxId int

SELECT @MaxId = MAX(AccountId) FROM Refresh.Account
SET @StartId = 1
SET @EndId = 1000

WHILE (@StartId < @MaxId)
BEGIN
    INSERT INTO Validate.ItemError (ItemId, ErrorId, DateCreated) 
    (
        SELECT ItemId, 10, GetUTCDate() 
        FROM Validate.Item 
        INNER JOIN Refresh.Company 
        ON Validate.Item.IMCompanyId = Refresh.Company.IMCompanyId 
        WHERE Refresh.Company.CompanyId = 14 
        AND 
        (
            IMAccountId IS NULL  
            OR NOT IMAccountId IN  
            (
                SELECT RA.IMAccountId 
                FROM Refresh.Account RA 
                INNER JOIN Refresh.BalancePool BP 
                ON RA.BalancePoolId = BP.BalancePoolId 
                WHERE BP.CompanyId = 14 
                AND RA.AccountId BETWEEN @StartId AND @EndId
            )
        )
    )
    SET @StartId = @StartId + 1000
    SET @EndId = @EndId + 1000
END

Doing it this way nets me a time of about a minute per loop; multiply that by 230 times and we have an even more ridiculous number.

Please tell me you guys have a better idea how to optimize this. Without this one query, the entire process only takes 8 seconds; it's just the sheer size of the Refresh.Account table that throws everything into chaos.

TIA!

Valkyrie

A: 

Does using NOT EXISTS help here?

(SELECT ItemId, 10, GetUTCDate() 
FROM Validate.Item INNER JOIN Refresh.Company ON 
Validate.Item.IMCompanyId = Refresh.Company.IMCompanyId 
WHERE Refresh.Company.CompanyId = 14 
AND (IMAccountId IS NULL  OR  NOT EXISTS (SELECT TOP 1 RA.IMAccountId FROM 
Refresh.Account RA INNER JOIN Refresh.BalancePool BP 
ON RA.BalancePoolId = BP.BalancePoolId WHERE BP.CompanyId = 14 AND 
RA.IMAcccountID = Validate.Item.IMAccountId)))

I am not sure, if the query is correct.

But, I am using NOT EXISTS along with TOP 1 inside the subquery.
Also, the subquery limits the record by adding an additional AND RA.IMAcccountID = Validate.Item.IMAccountId.

EDIT: I hope you get the idea of what I am trying to do.
Instead of checking it against all rows inside Refresh.Account, I am limiting the rows and trying to find atleast 1 matching row with matching IMAccountID - which should not be existing as per your original query (that uses NOT IN ...).

shahkalpesh
You were faster than me :)
Erlock
ALL of you rock! This took me to a sub 1-second query, and is EXACTLY what I needed. Thank you so much!
Valkyrie
+1  A: 

Use NOT EXISTS instead:

...OR  NOT EXISTS (SELECT 1 FROM 
Refresh.Account RA INNER JOIN Refresh.BalancePool BP 
ON RA.BalancePoolId = BP.BalancePoolId WHERE BP.CompanyId = 14 AND RA.IMAccountId = xxx.IMAccountId)))

The subquery following EXISTS will only return the first record satisfying criteria. (Remember to replace xxx with the alias of the right table)

Erlock
Virtually the same as shahkalpesh, but "TOP 1" is not required.
Philip Kelley
+1  A: 

Rather than doing a "not in", could you just do a left join to the relevant table and check for null keys? Not sure if the query is 100% correct:

INSERT INTO Validate.ItemError (ItemId, ErrorId, DateCreated) 
SELECT ItemId, 10, GetUTCDate() 
FROM Validate.Item 
INNER JOIN Refresh.Company ON Validate.Item.IMCompanyId = Refresh.Company.IMCompanyId 
LEFT JOIN Refresh.Account
  INNER JOIN Refresh.BalancePool BP ON BP.BalancePoolId = RA.BalancePoolId
ON Refresh.Account.IMAccountId = Validate.Item.IMAccountId 
WHERE Refresh.Company.CompanyId = 14 
AND Validate.Item.IMAccountId IS NULL OR Refresh.Account.IMAccountId IS NULL
Eric Petroelje
+2  A: 

Get rid of the OR condition.

It adds a fullscan and prevents the optimizer from using an ANTI JOIN which it would use otherwise.

This query returns the same:

SELECT  ItemId, 10, GetUTCDate() 
FROM    Validate.Item 
INNER JOIN
        Refresh.Company 
ON      Validate.Item.IMCompanyId = Refresh.Company.IMCompanyId 
WHERE   Refresh.Company.CompanyId = 14 
        AND NOT EXISTS
        (
        SELECT  RA.IMAccountId 
        FROM    Refresh.Account RA 
        INNER JOIN
                Refresh.BalancePool BP 
        ON      RA.BalancePoolId = BP.BalancePoolId 
        WHERE   BP.CompanyId = 14
                AND RA.IMAccounID = Validate.Item.IMAccountId
        )
Quassnoi