views:

9063

answers:

21

I'm trying to run the following SQL statement in Oracle, and it takes ages to run:

SELECT orderID FROM tasks WHERE orderID NOT IN 
(SELECT DISTINCT orderID FROM tasks WHERE
 engineer1 IS NOT NULL AND engineer2 IS NOT NULL)

If I run just the sub-part that is in the IN clause, that runs very quickly in Oracle, i.e.

SELECT DISTINCT orderID FROM tasks WHERE
engineer1 IS NOT NULL AND engineer2 IS NOT NULL

Why does the whole statement take such a long time in Oracle? In SQL Server the whole statement runs quickly.

Alternatively is there a simpler/different/better SQL statement I should use?

Some more details about the problem:

  • Each order is made of many tasks
  • Each order will be allocated (one or more of its task will have engineer1 and engineer2 set) or the order can be unallocated (all its task have null values for the engineer fields)
  • I am trying to find all the orderIDs that are unallocated.

Just in case it makes any difference, there are ~120k rows in the table, and 3 tasks per order, so ~40k different orders.

Responses to answers:

  • I would prefer a SQL statement that works in both SQL Server and Oracle.
  • The tasks only has an index on the orderID and taskID.
  • I tried the NOT EXISTS version of the statement but it ran for over 3 minutes before I cancelled it. Perhaps need a JOIN version of the statement?
  • There is an "orders" table as well with the orderID column. But I was trying to simplify the question by not including it in the original SQL statement.

I guess that in the original SQL statement the sub-query is run every time for each row in the first part of the SQL statement - even though it is static and should only need to be run once?

Executing

ANALYZE TABLE tasks COMPUTE STATISTICS;

made my original SQL statement execute much faster.

Although I'm still curious why I have to do this, and if/when I would need to run it again?

The statistics give Oracle's cost-based optimzer information that it needs to determine the efficiency of different execution plans: for example, the number of rowsin a table, the average width of rows, highest and lowest values per column, number of distinct values per column, clustering factor of indexes etc.

In a small database you can just setup a job to gather statistics every night and leave it alone. In fact, this is the default under 10g. For larger implementations you usually have to weigh the stability of the execution plans against the way that the data changes, which is a tricky balance.

Oracle also has a feature called "dynamic sampling" that is used to sample tables to determine relevant statistics at execution time. It's much more often used with data warehouses where the overhead of the sampling it outweighed by the potential performance increase for a long-running query.

+3  A: 

The "IN" - clause is known in Oracle to be pretty slow. In fact, the internal query optimizer in Oracle cannot handle statements with "IN" pretty good. try using "EXISTS":

SELECT orderID FROM tasks WHERE orderID NOT EXISTS 
    (SELECT DISTINCT orderID FROM tasks WHERE
         engineer1 IS NOT NULL AND engineer2 IS NOT NULL)`print("code sample");`

Caution: Please check if the query builds the same data results.

Edith says: ooops, the query is not well formed, but the general idea is correct. Oracle has to fulfill a full table scan for the second (inner) query, build the results and then compare them to the first (outer) query, that's why it's slowing down. Try

SELECT orderID AS oid FROM tasks WHERE NOT EXISTS 
    (SELECT DISTINCT orderID AS oid2 FROM tasks WHERE
         engineer1 IS NOT NULL AND engineer2 IS NOT NULL and oid=oid2)

or something similiar ;-)

Georgi
I came to the same query (see my answer below), except for : * the subquery has no reason to select DISTINCT orderIDs. * drop the "orderID" between WHERE and NOT EXISTS (syntax error). * drop the 'print("code sample"), obviously ;-)
Mac
When I try the second query it gets an error?ORA-00904: "OID2": invalid identifier
RickL
Use the "AS oid2" I inserted, it was not in the query before.
Georgi
@Mac: Oooops, 'print("code sample")', hmmm... my only excuse is that I'm working on a new laptop and didn't see it :) - thanks for your hints!
Georgi
Hmm, no, it still gives the same error... tried in SQL Plus and Toad Freeware.
RickL
Uh, I don't have a Oracle on my laptop at the moment, but give this a try: Completely skip the "AS oid2" and end the query with "oid=orderID)" - if you cant fix it, eliminate it ;-)
Georgi
"The "IN" - clause is known in Oracle to be pretty slow. In fact, the internal query optimizer in Oracle cannot handle statements with "IN" pretty good. try using "EXISTS":"Nonense. Where's the justification for this?
David Aldridge
AJ gave the correct syntax for the NOT EXISTS version of the SQL statement, but it is still too slow (over 3 minutes).
RickL
Georgi, where is your proof that IN is "slow" and the CBO can't handle it? If you're going to answer questions here, try to do so from a position of understanding, instead of one from hearsay and supposition
A: 

Isn't your query the same as

SELECT orderID FROM tasks
WHERE engineer1 IS NOT NULL OR engineer2 IS NOT NULL

?

ΤΖΩΤΖΙΟΥ
No it's not. I made the same mistake :-)Each order has mutliple tasks, and if one of those tasks has an engineer assigned, the order counts as "allocated"
RB
A: 

How about :

SELECT DISTINCT orderID FROM tasks t1 WHERE NOT EXISTS (SELECT * FROM tasks t2 WHERE t2.orderID=t1.orderID AND (engineer1 IS NOT NULL OR engineer2 IS NOT NULL));

I am not a guru of optimization, but maybe you also overlooked some indexes in your Oracle database.

Mac
I tried this but it was still running over a minute later when I cancelled it.
RickL
A: 

Another option is to use MINUS (EXCEPT on MSSQL)

SELECT orderID FROM tasks
MINUS
SELECT DISTINCT orderID FROM tasks WHERE engineer1 IS NOT NULL 
AND engineer2 IS NOT NULL
Vinko Vrsalovic
I thought of that, too, but I think this query is not the fastest, too.
Georgi
Yes, be careful, the minus statement uses alot of memory
Mike McAllister
Normally the implicit distinct would be unecessary overhead, but in this case it would probably be appropriate. Two scans of the table might be inefficient though.
David Aldridge
A: 

Sub-queries are "bad" with Oracle. It's generally better do use joins.

Here's an article on how to rewrite your subqueries with join : http://www.dba-oracle.com/sql/t_rewrite_subqueries_performance.htm

Michel
Nothing is necessary "bad". The difference between the benefits of different techniques almost always comes down to the value distributions of a particular data set and the presence or absence of indexes and constraints.
David Aldridge
A: 

Here is an alternate approach which I think gives what you want:

SELECT orderID
 FROM tasks
 GROUP BY orderID
 HAVING COUNT(engineer1) = 0 OR COUNT(engineer2) = 0

I'm not sure if you want "AND" or "OR" in the HAVING clause. It sounds like according to business logic these two fields should either both be populated or both be NULL; if this is guaranteed then you could reduce the condition to just checking engineer1.

Your original query would, I think, give multiple rows per orderID, whereas mine will only give one. I am guessing this is OK since you are only fetching the orderID.

Dave Costa
+2  A: 

Some questions:

  • How many rows are there in tasks?
  • What indexes are defined on it?
  • Has the table been analyzed recently?

Another way to write the same query would be:

select orderid from tasks
minus
select orderid from tasks
where engineer1 IS NOT NULL AND engineer2 IS NOT NULL

However, I would rather expect the query to involve an "orders" table:

select orderid from ORDERS
minus
select orderid from tasks
where engineer1 IS NOT NULL AND engineer2 IS NOT NULL

or

select orderid from ORDERS
where orderid not in
( select orderid from tasks
  where engineer1 IS NOT NULL AND engineer2 IS NOT NULL
)

or

select orderid from ORDERS
where not exists
( select null from tasks
  where tasks.orderid = orders.orderid
  and   engineer1 IS NOT NULL OR engineer2 IS NOT NULL
)
Tony Andrews
Your second proposal is NOT the same thing, since there can be multiple tasks per orderID, some assigned and some not. He only wants orderIDs for which NONE of the tasks are assigned.
Dave Costa
Your last proposal would need to have a correlated subquery (e.g. add "AND tasks.orderID = orders.orderID"). You can't simply change NOT IN to NOT EXISTS without modifying the subquery.
Dave Costa
Both points corrected now
Tony Andrews
A: 

Your query:

SELECT orderID
FROM tasks 
WHERE orderID NOT IN (
  SELECT DISTINCT orderID
  FROM tasks
  WHERE engineer1 IS NOT NULL AND engineer2 IS NOT NULL
)
;

Is equivalent to:

SELECT DISTINCT orderID
FROM tasks 
WHERE NOT ( engineer1 IS NOT NULL AND engineer2 IS NOT NULL )
;

Which can be further simplified to:

SELECT DISTINCT orderID
FROM tasks 
WHERE engineer1 IS NULL OR engineer2 IS NULL
;

Maybe Oracle does not realize how to optimize this but SQL Server does?

Kip
Not when there are multiple orderid's per task and they can have different statuses of engineer1 and engineer2 being null.
David Aldridge
Oh you're right. I'm so used to the primary key being the only field with "ID" in its name that I assumed orderID was the primary key of this table.
Kip
+2  A: 

I agree with TZQTZIO, I don't get your query.

If we assume the query did make sense then you might want to try using EXISTS as some suggest and avoid IN. IN is not always bad and there are likely cases which one could show it actually performs better than EXISTS.

The question title is not very helpful. I could set this query up in one Oracle database and make it run slow and make it run fast in another. There are many factors that determine how the database resolves the query, object statistics, SYS schema statistics, and parameters, as well as server performance. Sqlserver vs. Oracle isn't the problem here.

For those interested in query tuning and performance and want to learn more some of the google terms to search are "oak table oracle" and "oracle jonathan lewis".

Ethan Post
+2  A: 

I would try using joins instead

SELECT 
    t.orderID 
FROM 
    tasks  t
    LEFT JOIN tasks t1
     ON t.orderID = t1.orderID
     AND t1.engineer1 IS NOT NULL 
     AND t1.engineer2 IS NOT NULL
WHERE
    t1.orderID IS NULL

also your original query would probably be easier to understand if it was specified as:

SELECT orderID FROM orders WHERE orderID NOT IN 
(SELECT DISTINCT orderID FROM tasks WHERE
 engineer1 IS NOT NULL AND engineer2 IS NOT NULL)

(assuming you have orders table with all the orders listed)

which can be then rewritten using joins as:

SELECT 
    o.orderID 
FROM 
    orders o
    LEFT JOIN tasks t
     ON o.orderID = t.orderID
     AND t.engineer1 IS NOT NULL 
     AND t.engineer2 IS NOT NULL
WHERE
    t.orderID IS NULL
kristof
This doesn't work because it should only return orders where *all* tasks engineer fields are not null, but this will return orders where some tasks are engineer noll, and some tasks are engineer not null. Did you mean for the WHERE clause to be different?
RickL
It think it returns the same result as the query that you specified: SELECT orderID FROM tasks WHERE orderID NOT IN (SELECT DISTINCT orderID FROM tasks WHERE engineer1 IS NOT NULL AND engineer2 IS NOT NULL)which I believe returns:All orders that do not have any tasks with both engineers assigned
kristof
But perhaps I am missing something, if one assigned engineer is enough not to list the order then change the left join to read as: LEFT JOIN tasks t1 ON t.orderID = t1.orderID AND (t1.engineer1 IS NOT NULL OR t1.engineer2 IS NOT NULL) But that would be different from your original query.
kristof
A: 

If you have no index over the Engineer1 and Engineer2 columns then you are always going to generate a Table Scan in SQL Server and the equivalent whatever that may be in Oracle.

If you just need the Orders that have unallocated tasks then the following should work just fine on both platforms, but you should also consider adding the indexes to the Tasks table to improve query perfomance.

SELECT DISTINCT orderID 
FROM tasks 
WHERE (engineer1 IS NULL OR engineer2 IS NULL)
Jim Birchall
Typically, the engineer fields will be foreign keys, so the indexes should be there.
ΤΖΩΤΖΙΟΥ
@ΤΖΩΤΖΙΟΥ Thanks for the downvote, but did you read the question? "The tasks only has an index on the orderID and taskID"
Jim Birchall
A: 

If you decide to create an ORDERS table, I'd add an ALLOCATED flag to it, and create a bitmap index. This approach also forces you to modify the business logic to keep the flag updated, but the queries will be lightning fast. It depends on how critical are the queries for the application.

Regarding the answers, the simpler the better in this case. Forget subqueries, joins, distinct and group bys, they are not needed at all!

+1  A: 

I think several people have pretty much the right SQL, but are missing a join between the inner and outer queries.
Try this:

SELECT t1.orderID 
FROM   tasks t1
WHERE  NOT EXISTS
       (SELECT 1 
        FROM   tasks t2 
        WHERE  t2.orderID   = t1.orderID
        AND    t2.engineer1 IS NOT NULL 
        AND    t2.engineer2 IS NOT NULL)
AJ
Thanks, I tried this and it is the correct syntax, but it was still running for over 3 minutes when I cancelled it.
RickL
+6  A: 

Often this type of problem goes away if you analyze the tables involved (so Oracle has a better idea of the distribution of the data)

ANALYZE TABLE tasks COMPUTE STATISTICS;
hamishmcn
Incredible, after I ran this it took just under 1 second to run my original SQL statement.
RickL
This is an obsolete syntax for collecting statistics. DBMS_STATS is a more robust way. http://download.oracle.com/docs/cd/B19306_01/server.102/b14211/stats.htm#PFGRF30102
David Aldridge
I agree, this is obsolete. Please try to avoid using it. Use DBMS_STATS and make sure you also get your indexes, you can set cascade=>true when you gather stats for the table.
Ethan Post
A: 

What proportion of the rows in the table meet the condition "engineer1 IS NOT NULL AND engineer2 IS NOT NULL"?

This tells you (roughly) whether it might be worth trying to use an index to retrieve the associated orderid's.

Another way to write the query in Oracle that would handle unindexed cases very well would be:

select distinct orderid
from
(
select orderid,
       max(case when engineer1 is null and engineer2 is null then 0 else 1)
          over (partition by orderid)
          as max_null_finder
from   tasks
)
where max_null_finder = 0
David Aldridge
A: 

The Oracle optimizer does a good job of processing MINUS statements. If you re-write your query using MINUS, it is likely to run quite quickly:

SELECT orderID FROM tasks
MINUS
SELECT DISTINCT orderID FROM tasks WHERE
 engineer1 IS NOT NULL AND engineer2 IS NOT NULL
JoshL
A: 

New take.

Iff:

  • The COUNT() function does not count NULL values

and

  • You want the orderID of all tasks where none of the tasks have either engineer1 or engineer2 set to a value

then this should do what you want:

SELECT orderID
FROM tasks
GROUP BY orderID
HAVING COUNT(engineer1) = 0 AND COUNT(engineer2) = 0

Please test it.

ΤΖΩΤΖΙΟΥ
+1  A: 

"Although I'm still curious why I have to do this, and if/when I would need to run it again?"

The statistics give Oracle's cost-based optimzer information that it needs to determine the efficiency of different execution plans: for example, the number of rowsin a table, the average width of rows, highest and lowest values per column, number of distinct values per column, clustering factor of indexes etc.

In a small database you can just setup a job to gather statistics every night and leave it alone. In fact, this is the default under 10g. For larger implementations you usually have to weigh the stability of the execution plans against the way that the data changes, which is a tricky balance.

Oracle also has a feature called "dynamic sampling" that is used to sample tables to determine relevant statistics at execution time. It's much more often used with data warehouses where the overhead of the sampling it outweighed by the potential performance increase for a long-running query.

David Aldridge
Thanks, copied your answer into the question.
RickL
Dave's on a Dynamic Sampling Binge
I loves the dynamic sampling. I don't know it isn't spoken of more often.
David Aldridge
@Dave, only if your queries are big and infrequent. Otherwise you spend more time sampling that doing.
WW
Well it all bears measuring of course. DS impact depends on a lot of issues, and you can get information from a DS query that lasts a tiny fraction of a second that can take a query from 10 seconds down to a fraction of a second. For example dynamic sampling of a part of a query in which multiple predicates run against a small dimension table (public holidays on a Monday in a particular year? Women over 5'11''?) need not take long but can be very valuable. Long story short, it's just another tool to understand and appreciate for what it is.
David Aldridge
A: 

For your followup question, you may be interested in this:

http://stackoverflow.com/questions/79229/how-often-should-oracle-database-statistics-be-run

Dave Costa
A: 

Could you provide the explain plans for both queries, so we have some idea what the optimizer thinks about them both?

EvilTeach
this isn't an answer, it's a comment. Put these in the comments to the question section.
A: 

I agree with ΤΖΩΤΖΙΟΥ and wearejimbo that your query should be...

SELECT DISTINCT orderID FROM Tasks 
WHERE Engineer1 IS NULL OR Engineer2 IS NULL;

I don't know about SQL Server, but this query won't be able to take advantage of any indexes because null rows aren't in indexes. The solution to this would be to re-write the query in a way that would allow a function based index to be created that only includes the null value rows. This could be done with NVL2, but would likely not be portable to SQL Server.

I think the best answer is not one that meets your criteria and that is write a different statement for each platform that is best for that platform.

Leigh Riffel
SQL Server allows only one NULL row in an index except you create a filtered index. In this case it does not include NULLs.
usr
Thanks for the info.
Leigh Riffel