views:

606

answers:

4

I have a 1:1 relationship between two tables. I want to find all the rows in table A that don't have a corresponding row in table B. I use this query:

SELECT id 
  FROM tableA 
 WHERE id NOT IN (SELECT id 
                    FROM tableB) 
ORDER BY id desc

id is the primary key in both tables. Apart from primary key indices, I also have a index on tableA(id desc).

Using H2 (Java embedded database), this results in a full table scan of tableB. I want to avoid a full table scan.

How can I rewrite this query to run quickly? What index should I should?

+9  A: 
select tableA.id from tableA left outer join tableB on (tableA.id = tableB.id)
where tableB.id is null
order by tableA.id desc

If your db knows how to do index intersections, this will only touch the primary key index

SquareCog
This is why I love Stack Overflow. Saturday, SQL problem - question answered precisely and successfully in 5 minutes!
Steve McLeod
you got some good suggestions in the other answers as well. Naturally I think mine will be fastest :-) but db implementations vary widely, and I've no experience with H2. It would be great if you benchmarked the different approaches and updated the question with your results.
SquareCog
+6  A: 
Eric
+1: The EXISTS condition is considered "to be met" if the subquery (correllated in this case) returns at least one row.
OMG Ponies
benchmarking is a good idea. I am wracking my brain trying to figure out what a db could be doing under the covers for exists+correlated subquery that would make it faster than an index-only hash join. Do you know?
SquareCog
`Exists` isn't using your standard correlated subquery. It uses a semi-join. The execution plan on SQL Server 2008 for `left join` is two index scans to a hash match to a filter to a select. For the `not exists`, it is two index scans to a hash match to a select--no filter. The `exists` hash match is actually slightly faster than the `left join`. `left join` has a total cost of 1.09, `not exists` of 1.07 on `DimCustomer` for `AdventureWorksDW` to `AdventureWorksDW2008`.
Eric
Nice!! Thanks. That's one smart optimizer. Granted the cost is approximate, but I buy it on the filter vs semijoin principle.
SquareCog
A: 

I can't tell you which of these methods will be best on H2 (or even if all of them will work), but I did write an article detailing all of the (good) methods available in TSQL. You can give them a shot and see if any of them works for you:

http://code.msdn.microsoft.com/SQLExamples/Wiki/View.aspx?title=QueryBasedUponAbsenceOfData&referringTitle=Home

Aaron Alton
+2  A: 

You have to check every ID in tableA against every ID in tableB. A fully featured RDBMS (such as Oracle) would be able to optimize that into an INDEX FULL FAST SCAN and not touch the table at all. I don't know whether H2's optimizer is as smart as that.

H2 does support the MINUS syntax so you shou;d try this

select id from tableA
minus
select id from tableB
order by id desc

That may perform faster; it is certainly worth benchmarking.

APC