tags:

views:

46

answers:

2

Sounds simple but I'm stuck

Table A                 Table B

col_a  col_b          col_a  col_c
1      b              1      c
2      c              2      d
3      z              3      a
4      d              4      e
33     a              5      k
                      6      l
                      33     b
                      33     b

I want to JOIN table A with B:

select * from A inner join B on A.col_a = B.col_a

I am expecting to get 5 records as a result.

Expected join result         ** Actual result **

col_a  col_b col_c col_x[n]...       col_a col_b col_c  col_y[n]...
1      b     c     ...                   1     b     c  ...
2      c     d     ...                   2     c     d  ...
3      z     a     ...                   3     z     a  ...
4      d     e     ...                   4     d     e  ...
33     a     b     ...                   33    a     b  ...
                                         33    a     b  ...

Why did MySQL match 33 twice? Because they are 2 values with 33 in table B.

What I want though, is just one record with the same value in col_a. How do I do that?

EDIT: I am updating the tables' design to include more columns that contain non-identical data, because them being as they were posed more questions than solved problems. Anyway, the answer to this is to use GROUP BY, but the performance penalty is huge, especially on a table that contains above 50 million records (and growing). However, the best approach to solve my problem was to use a compound statement (using UNION ALL) for every distinct value in col_a. The performance benefit was x5 ~ x10 faster !!

+4  A: 

You have 33 twice in Table B.

Either SELECT DISTINCT or GROUP BY col_a, ...:

SELECT DISTINCT * 
FROM    A 
JOIN    B ON ( A.col_a = B.col_a )
;

or

SELECT    * 
FROM      A 
JOIN      B ON ( A.col_a = B.col_a )
GROUP BY  col_a, col_b, col_c
;

You should clean up that table, though. Depending on how many occurrences of a repeated row, it might be faster to use a subquery:

SELECT  * 
FROM    A 
JOIN    (select distinct * from B) AS C
        ON ( A.col_a = C.col_a )
;
vol7ron
notice the lack of `INNER` as `JOIN` by itself is an `INNER JOIN` by default.
vol7ron
the subquery is not advised and depends on the database's query plan. but if you have **many** duplicate entries from both tables (for whatever awful reason), then the subquery would reduce the work involved.
vol7ron
GROUP BY seems to do the trick (although there is some performance penalty)!
Theo Zographos
`DISTINCT` and `GROUP BY` are going to come at a performance penlaty vs not having it, due to the extra check/logic involved in the distinction. That's one reason why it's important to have unique records only - something that can be solved by adding a `unique constraint` or distinguishing a `primary key`. The difference between the two is that `DISTINCT` is a `GROUP BY` with an `ORDER BY`, so there is an added penalty in sorting.
vol7ron
+2  A: 

The quick & dirty answer is:

select DISTINCT * from A inner join B on A.col_a = B.col_a

But the real question is, why do you have two identical entries in Table B?

Usually when you have to use DISTINCT, it indicates a problem in your data model.

Drew Hall
Not all records are identical column to column, because there's a auto increment field in tables A and B. It's basically a logging application and the actual problem is more complex, but "duplicate" records are unavoidable :( Anyway, using DISTINCT * doesn't help
Theo Zographos
@Theo Zographo: "Auto increment artificial primary key causes real life duplicates" is not a new story :(
onedaywhen