views:

283

answers:

5

Hello, I'm having a problem getting a query to work, which I think should work. It's in the form

SELECT DISTINCT a, b, c FROM t1 WHERE NOT IN ( SELECT DISTINCT a,b,c FROM t2 ) AS alias

But mysql chokes where "IN (" starts. Does mysql support this syntax? If not, how can I go about getting these results? I want to find distinct tuples of (a,b,c) in table 1 that don't exist in table 2.

A: 

As far as I know, NOT IN can only be used for 1 field at a time. And the field has to be specified in between "WHERE" and "NOT IN".

(Edit:) Try using a NOT EXISTS:

SELECT a, b, c 
FROM t1 
WHERE NOT EXISTS 
   (SELECT * 
   FROM t2 
   WHERE t1.a = t2.a AND t1.b = t2.b AND t1.c = t2.c)

In addition, an inner join on a, b, and c being equal should give you all non-DISTINCT tuples, while a LEFT JOIN with a WHERE IS NULL clause should give you the DISTINCT ones, as Charles mentioned below.

froadie
Holy God, that would make a nightmare query. I only care about the values of a, b, or c depending on the values of the other two!
How about using a join? Inner joining on all 3 fields should return all non-distinct tuples
froadie
Or how about using exists? SELECT a, b, c FROM t1 WHERE NOT EXISTS (SELECT * FROM t2 WHERE t1.a = t2.a AND t1.b = t2.b AND t1.c = t2.c). Not sure if this is exactly correct, I haven't had much experience with exists
froadie
A: 

SELECT DISTINCT t1.* FROM t1 LEFT JOIN t2 ON (t1.a = t2.a AND t1.b = t2.b AND t1.c = t2.c) WHERE t2.a IS NULL

Charles
This is a very bad idea, if the tables are big that left join will be painfully slow. You're doing alot of work, reading to both tables fully creating a left join result set, etc., which is simply unnecessary.
wich
Indeed. Not performance wise.
Charles
+2  A: 

You should use not exists:

SELECT DISTINCT a, b, c FROM t1 WHERE NOT EXISTS (SELECT NULL FROM t2 WHERE t1.a = t2.a AND t1.b = t2.b AND t1.c = t2.c)

Using NOT IN is not the best method to do this, even if you check only one key. The reason is that if you use NOT EXISTS the DBMS will only have to check indices if indices exist for the needed columns, where as for NOT IN it will have to read the actual data and create a full result set that subsequently needs to be checked.

Using a LEFT JOIN and then checking for NULL is also a bad idea, it will be painfully slow when the tables are big since the query needs to make the whole join, reading both tables fully and subsequently throw away a lot of it. Also, if the columns allow for NULL values checking for NULL will report false positives.

wich
You know what? I ran this query, and it was taking a long time ( longer than 10 minutes ), so I killed it. Then I set up temp tables for the two tables and inserted distinct information. Then I ran the query against the temp tables. Took one minute and 4 seconds. Why couldn't mysql optimize this query thusly?
To be honest, mysql is rather stupid and slow in some regards. Oracle, MS SQL and PostgreSQL do a lot better in many regards. Of course it will also be a lot faster if you add indices to the columns of the tables, if you don't have them already. although that will go at the cost of insert time since the indices need to be updated each time, it's a matter of how many times you do each operation and which is the more time critical.
wich
A: 

Need to add a column list after the WHERE clause and REMOVE the alias.

I tested this with a similar table and it is working.

SELECT DISTINCT a, b, c 
FROM t1 WHERE (a,b,c)
NOT IN (SELECT DISTINCT a,b,c FROM t2)

Using the mysql world db:

-- dont include city 1, 2
SELECT DISTINCT id, name FROM city 
WHERE (id, name) 
NOT IN (SELECT id, name FROM city  WHERE ID IN (1,2))
Yada
It would be better to use NOT EXISTS, NOT IN will force a result set to be made for the sub query by reading the whole table, when you use NOT EXISTS no result set needs to be created for the sub query and if the columns are indexed, the NOT EXISTS will only read the indices.
wich
Are you sure about that? If the columns are not indexed NOT EXISTS is as slow NOT IN.
Yada
A: 

Well, I'm going to answer my own question, in spite of all the great advice others gave.

Here's the proper syntax for what I was trying to do.

SELECT DISTINCT a, b, c FROM t1 WHERE (a,b,c) NOT IN ( SELECT DISTINCT a,b,c FROM t2 )

Can't vouch for the efficiency of it, but the broader questions I was implicitly putting was "How do I express this thought in SQL", not "How do I get a particular result set". I know that's unfair to everyone who took a stab, sorry!

Depending on the definition of columns a, b and c of t2 this could be wrong! If they allow for NULL values then the result for the NOT IN will always be unknown for any such value. NOT EXISTS *is* the right way to express this, NOT EXISTS was created for this.
wich
Awesome! Thanks.