views:

31

answers:

2

I work with data that comes from multiple sources that I have no control over. These sources tend to have duplicates in the "key" values. I need to keep any of these duplicate values form matching in a join.

Using the following data

T1
| ID | FirstKey | SecondKey | ThirdKey | AdditionalColumns   |
+----+----------+-----------+----------+---------------------+
| 01 | Prod1    | ABC1      | 201      | Jun 2010, A, 101    |
| 02 | Prod2    | DEF2      | 202      | May 2009, A, 101    |
| 03 | Prod2    | DEF2      | 202      | May 2010, S, 101    |
| 04 | Prod3    |           | 206      | Jun 2010, A, 103    |
| 05 | Prod4    |           | 207      | Jun 2011, S, 103    |


T2
| ID | FirstKey | SecondKey | ThirdKey | AdditionalColumns   |
+----+----------+-----------+----------+---------------------+
| 01 | Prod1    | ABC1      | 201      | Jun 2010, A, 101    |
| 02 | Prod2    | DEF2      |          | May 2009, A, 101    |
| 03 | Prod2    | DEF2      | 202      | May 2010, S, 101    |
| 04 | Prod3    |           |          | Jun 2010, A, 103    |
| 05 | Prod4    |           | 207      | Jun 2011, S, 103    |
| 06 | Prod1    | ABC1      | 201      | Jun 2010, T, 101    |

Now if we do the query:

SELECT 
       T1.FirstKey, T1.SecondKey, T1.ThirdKey,
       T2.FirstKey, T2.SecondKey, T2.ThirdKey,
       T1.AdditionalColumns, T2.AdditionalColumns
FROM 
       T1 JOIN T2 ON T1.FirstKey = T2.FirstKey 
          AND T1.SecondKey = T2.SecondKey
          AND T1.SecondKey IS NOT NULL
UNION
SELECT 
       T1.FirstKey, T1.SecondKey, T1.ThirdKey,
       T2.FirstKey, T2.SecondKey, T2.ThirdKey, 
       T1.AdditionalColumns, T2.AdditionalColumns
FROM 
       T1 JOIN T2 ON T1.FirstKey = T2.FirstKey 
          AND T1.ThirdKey = T2.ThirdKey
          AND T1.SecondKey IS NULL

We get the following Results

FirstKey  SecondKey  ThirdKey  FirstKey  SecondKey  ThirdKey  AdditionalColumns  AdditionalColumns
--------  ---------  --------  --------  ---------  --------  -----------------  -----------------
Prod1     ABC1       201       Prod1     ABC1       201       Jun 2010, A, 101   Jun 2010, A, 101
Prod1     ABC1       201       Prod1     ABC1       201       Jun 2010, A, 101   Jun 2010, T, 101
Prod2     DEF2       202       Prod2     DEF2       202       May 2009, A, 101   May 2010, S, 101
Prod2     DEF2       202       Prod2     DEF2       202       May 2010, S, 101   May 2010, S, 101
Prod4     NULL       207       Prod4     NULL       207       Jun 2011, S, 103   Jun 2011, A, 103

I need the query to only return the records with an authoritative match. e.g. Only 1 match between the tables.

FirstKey  SecondKey  ThirdKey  FirstKey  SecondKey  ThirdKey  AdditionalColumns  AdditionalColumns
--------  ---------  --------  --------  ---------  --------  -----------------  -----------------
Prod4     NULL       207       Prod4     NULL       207       Jun 2011, S, 103   Jun 2011, A, 103

Is there a way to do this in the JOIN?

Currently I can get uniques by making CTE for each table that guarantees uniqueness on the keys used in the join. This works, but is ugly and adds significant work to the query.

Is there another way to do this join that will exclude the duplicate matches? This assumes I cannot programatically exclude any of the duplicate rows based on the AdditionalColumns data.

I run into this over and over so the CTE method seems just kludgey as it must be a problem that's been solved.

+1  A: 

How about using GROUP BY around your query:

SELECT T1.FirstKey, T1.SecondKey, T1.ThirdKey, T2.FirstKey, T2.SecondKey, T2.ThirdKey, T1.AdditionalColumns, T2.AdditionalColumns, COUNT(*)
FROM (
SELECT 
       T1.FirstKey, T1.SecondKey, T1.ThirdKey,
       T2.FirstKey, T2.SecondKey, T2.ThirdKey,
       T1.AdditionalColumns, T2.AdditionalColumns
FROM 
       T1 JOIN T2 ON T1.FirstKey = T2.FirstKey 
          AND T1.SecondKey = T2.SecondKey
          AND T1.SecondKey IS NOT NULL
UNION
SELECT 
       T1.FirstKey, T1.SecondKey, T1.ThirdKey,
       T2.FirstKey, T2.SecondKey, T2.ThirdKey, 
       T1.AdditionalColumns, T2.AdditionalColumns
FROM 
       T1 JOIN T2 ON T1.FirstKey = T2.FirstKey 
          AND T1.ThirdKey = T2.ThirdKey
          AND T1.SecondKey IS NULL
)
GROUP BY T1.FirstKey, T1.SecondKey, T1.ThirdKey, T2.FirstKey, T2.SecondKey, T2.ThirdKey, T1.AdditionalColumns, T2.AdditionalColumns
HAVING COUNT(*) = 1;
gregjor
A: 

Hi,

A suggestion.

Make your whole select a subquery. Let's name it SUBQ

Then you do it like this:

SELECT *
FROM (SUBQ)
GROUP BY `ThirdKey`
HAVING COUNT(*) = 1
Alin Purcaru