I'm trying to run the following SQL statement in Oracle, and it takes ages to run:
SELECT orderID FROM tasks WHERE orderID NOT IN
(SELECT DISTINCT orderID FROM tasks WHERE
engineer1 IS NOT NULL AND engineer2 IS NOT NULL)
If I run just the sub-part that is in the IN clause, that runs very quickly in Oracle, i.e.
SELECT DISTINCT orderID FROM tasks WHERE
engineer1 IS NOT NULL AND engineer2 IS NOT NULL
Why does the whole statement take such a long time in Oracle? In SQL Server the whole statement runs quickly.
Alternatively is there a simpler/different/better SQL statement I should use?
Some more details about the problem:
- Each order is made of many tasks
- Each order will be allocated (one or more of its task will have engineer1 and engineer2 set) or the order can be unallocated (all its task have null values for the engineer fields)
- I am trying to find all the orderIDs that are unallocated.
Just in case it makes any difference, there are ~120k rows in the table, and 3 tasks per order, so ~40k different orders.
Responses to answers:
- I would prefer a SQL statement that works in both SQL Server and Oracle.
- The tasks only has an index on the orderID and taskID.
- I tried the NOT EXISTS version of the statement but it ran for over 3 minutes before I cancelled it. Perhaps need a JOIN version of the statement?
- There is an "orders" table as well with the orderID column. But I was trying to simplify the question by not including it in the original SQL statement.
I guess that in the original SQL statement the sub-query is run every time for each row in the first part of the SQL statement - even though it is static and should only need to be run once?
Executing
ANALYZE TABLE tasks COMPUTE STATISTICS;
made my original SQL statement execute much faster.
Although I'm still curious why I have to do this, and if/when I would need to run it again?
The statistics give Oracle's cost-based optimzer information that it needs to determine the efficiency of different execution plans: for example, the number of rowsin a table, the average width of rows, highest and lowest values per column, number of distinct values per column, clustering factor of indexes etc.
In a small database you can just setup a job to gather statistics every night and leave it alone. In fact, this is the default under 10g. For larger implementations you usually have to weigh the stability of the execution plans against the way that the data changes, which is a tricky balance.
Oracle also has a feature called "dynamic sampling" that is used to sample tables to determine relevant statistics at execution time. It's much more often used with data warehouses where the overhead of the sampling it outweighed by the potential performance increase for a long-running query.