tags:

views:

45

answers:

3

The query below selects the rows from table_1 which do not exist in table_2 based on values in the 4 columns. The mismatch could be due to one or more columns.

I would like to extend the query to tell me which column(s) had the mismatched values, either by showing the column name or its value. I can do this in a cursor but prefer to do it in a set based operation if possible.

SELECT  i.agent ,
        i.agency ,
        i.customer ,
        i.Company
FROM    table_1 AS i
WHERE   NOT EXISTS ( SELECT p.agent ,
                            p.agency ,
                            p.customer ,
                            p.Company
                     FROM   table_2 AS p
                     WHERE  i.Agent = p.Agent
                            AND i.agency = p.Agency
                            AND i.customer = p.customer
                            AND i.Company = p.Company )

Update: I guess this needs more refinement. Let's add that 3 out of 4 columns need to match.

A: 

Is this well-defined? if [1,2,3,4] and [5,2,3,5] exist in p, [1,2,3,5] in i is a mismatch against [1,2,3,4] by column d and a mismatch against [5,2,3,5] by column a. Or would you claim that it is a mismatch due to both d and a?

Jimmy
+1  A: 

As others are saying, this problem needs to be defined a little bit more.

Lets say you have these objects in table 1:

  1. Big, red, sweet apple
  2. Small, green, tart apple
  3. Medium, orange, sweet peach
  4. Medium, orange, sour tangerine

And these objects in table 2:

  1. Small, red, sweet plum
  2. Big, red, sweet apple
  3. Small, green, sour apple
  4. Small, orange, sweet tangerine

Now, let's go through table 1:

  1. Big, red, sweet apple matches item 2 in table 2
  2. Small, green, tart apple has no match - which column(s) mismatch?
    • table 2, item 1 - mismatches on size, color, taste, and fruit
    • table 2, item 1 - mismatches on size, color, and taste
    • ...
    • ...
  3. (also has no match)
    • ...
    • ...
    • ...
    • ...
  4. (also has no match)
    • ...
    • ...
    • ...
    • ...

As you can see, this becomes an exponential problem very quickly. With just four items in each list, you have 12 mismatched pairs to list columns for. If you had 100,000 items and 10,000 were mismatches, you'd have to list 100 million pairs.

If you redefine your query so that it lists only items that differ on one attribute (column), then it might be a little more doable. Something like this:

-- warning untested code --

         SELECT 'agent' AS MismatchedColumn ,
                p.agent AS MismatchedValue ,
                i.agent ,
                i.agency ,
                i.customer , 
                i.company
           FROM table_1 AS i
LEFT OUTER JOIN table_2 AS p
             ON i.agent != p.agent
                AND i.agency = p.agency
                AND i.customer = p.customer
                AND i.company = p.company
          UNION
         SELECT 'agency' AS MismatchedColumn ,
                p.agency AS MismatchedValue ,
                i.agent ,
                i.agency ,
                i.customer ,
                i.company
           FROM table_1 AS i
LEFT OUTER JOIN table_2 AS p
             ON i.agent = p.agent
                AND i.agency != p.agency
                AND i.customer = p.customer
                AND i.company = p.company
          UNION
         SELECT 'customer' AS MismatchedColumn ,
                p.customer AS MismatchedValue ,
                i.agent ,
                i.agency ,
                i.customer ,
                i.company
           FROM table_1 AS i
LEFT OUTER JOIN table_2 AS p
             ON i.agent = p.agent
                AND i.agency = p.agency
                AND i.customer != p.customer
                AND i.company = p.company
          UNION
         SELECT 'company' AS MismatchedColumn ,
                p.company AS MismatchedValue ,
                i.agent ,
                i.agency ,
                i.customer ,
                i.company
           FROM table_1 AS i
LEFT OUTER JOIN table_2 AS p
             ON i.agent = p.agent
                AND i.agency = p.agency
                AND i.customer = p.customer
                AND i.company != p.company

I believe this will list all the records in table 1 that match on all but one column in table 2.

DanM
I added another requirement in my question.
Tony_Henrich
I guessed right :)
DanM
Good answer given the circumstances... of course if someone decides that they want to retrieve records matching only 2 columns, you're hosed. ;)
Aaronaught
True, would be a lot more UNION statements :) I'm looking at your answer now.
DanM
+2  A: 

You can simplify this problem drastically if you require that certain columns do match, or at least start with some expectations of which columns should match. In other words, instead of looking at this as a non-matching problem, redefine it as a partial matching problem.

Let's say you expect agent and agency to match, but customer and company might not. This isn't too difficult:

SELECT
    i.agent, i.agency, i.customer, i.company, p.customer, p.company,
    CASE
        WHEN i.customer = p.Customer THEN 'Y'
        ELSE 'N'
    END AS matchescustomer,
    CASE
        WHEN i.company = p.Company THEN 'Y'
        ELSE 'N'
    END AS matchescompany
FROM table1 i
INNER JOIN table2 p
    ON p.agent = i.agent
    AND p.agency = i.agency

If you want to check for other partial matches, just reorder the columns. Instead of joining on agent and agency, join on agent and customer, or whatever.

If you only expect a few different kinds of partial matches, you can write a few different queries similar to the one above and put them together with a UNION (or UNION ALL if you don't mind duplicates). In other words:

SELECT (columns)
FROM table1 i INNER JOIN table2 p
    ON p.agent = i.agent AND p.agency = i.agency
UNION
SELECT (columns)
FROM table1 i INNER JOIN table2 p
    ON p.agent = i.agent AND p.customer = i.customer

Now if you're looking to get every conceivable mismatch, then this is quickly going to get out of control, so you might want to adapt a more heuristic method, and search for partial matches that match at least a certain number of columns (say 3). Then you can restrict the obnoxiousness to at most the number of columns you have to compare:

;WITH PartialMatches_CTE AS
(
    SELECT i.agent AS iagent, p.agent AS pagent, ... (etc.)
    FROM table1 i INNER JOIN table2 p ON p.agent = i.agent
    UNION ALL
    SELECT (...) FROM table1 INNER JOIN table2 ON p.agency = i.agency
    UNION ALL
    SELECT (...) FROM table1 INNER JOIN table2 ON p.company = i.company
    ... and so on
),
ResolvedMatches_CTE AS
(
    SELECT DISTINCT
        iagent, pagent, iagency, pagency, ...,
        CASE WHEN pagent = iagent THEN 'Y' ELSE 'N' END AS agentmatch,
        CASE WHEN pagency = iagency THEN 'Y' ELSE 'N' END AS agencymatch,
        ...,
        (CASE WHEN pagent = iagent THEN 1 ELSE 0 END +
         CASE WHEN pagency = iagency THEN 1 ELSE 0 END +
         ...) AS MatchCount
    FROM PartialMatches_CTE
)
SELECT *
FROM ResolvedMatches_CTE
WHERE MatchCount >= 3

Now, having said all this, there's one thing I'm wondering...

These two data tables aren't, perchance, sequentially related, are they? As in, the 3rd row in table1 always maps to the 3rd row in table 2, but may not match all columns? It's a shot in the dark, but if that is indeed the case, then we could make this way simpler. Otherwise, the last query here should probably do what you want without becoming too much of an unmaintainable mess.

Note that performance is likely going to stink for all of these queries. Hopefully your data sets aren't too large. AFAIK there is no easy way to really optimize this kind of thing.

Aaronaught
+1 That is definitely more flexible than my solution. A little harder to wrap my head around though. I guess `PartialMatches` lists out anything that matches on one or more column, then `ResolvedMatches` says whether each column is a match (Y or N) and also counts the number of matches. The `SELECT DISTINCT` gets rid of all the duplicates. One question: what would happen if you did `SELECT DISTINCT` on the first CTE instead of the second?
DanM
@DanM: The CTEs aren't strictly necessary, they're just there for readability. The first CTE gets the list of partial matches (1 or more columns), the second classifies them, and the final query filters them. Using `DISTINCT` within the first CTE wouldn't help you, but you'd get the same result if you took out the `DISTINCT` in `Resolved` and changed each `UNION ALL` to `UNION` in `PartialMatches_CTE`. My gut feeling is that the version I wrote will be faster, but they'll both be pretty darn slow.
Aaronaught
Also, the first set of `CASE` statements in `ResolvedMatches_CTE` aren't strictly necessary either if all you want out of this is the non-matching rows. They're just there to help quickly identify to the reader which columns don't match, so that they don't have to stare at 8 or more columns and compare visually.
Aaronaught