views:

366

answers:

5

I want to efficiently check if a table contains any rows that match <condition A> and do not match <condition B>, where the conditions are arbitrary.

In Oracle, this almost works:

select count(*) from dual
where exists (
  select * from people
  where (<condition A>)
  and not (<condition B>)
);
-- returns zero if all rows that match <condition A> also match <condition B>
-- (well, almost)

The problem is the dreaded null values. Lets say <condition A> is name = 'Aaron' and <condition B> is age = 21. The query will correctly identify any Aarons whose age is not equal to 21, but it fails to identify any Aarons whose age is null.

Here is a correct solution, but on a table with millions of records it can take a while:

select (
  select count(*) from people
  where (<condition A>)
) - (
  select count(*) from people
  where (<condition A>)
  and (<condition B>)
) from dual;
-- returns zero if all rows that match <condition A> also match <condition B>
-- (correct, but it is s l o w...)

Unfortunately the two conditions will be arbitrary, complex, changing, and generally out of my control. They are generated from the application's persistence framework from user searches, and while we try to keep our indexes up with our users, a lot of the time they will cause big table scans (which is why that first query with the "exists" clause is so much faster than the second - it can stop as soon as it finds one matching record, and it doesn't have to do two separate scans).

How can I do this efficiently without futzing up on the nulls?

A: 

I don't think there's anything you can do if the conditions are completely arbitrary. Is it possible to "re-write" the conditions at some point based on some rules?

I believe that if you do this:

... where not (age = 21) ....

which translates internally into:

... where (age != 21) ...

you get too few records because it doesn't match null values, right?

But if you do this:

... where not (age = 21 and age is not null) ....

which translates internally into:

... where (age != 21 or age is null) ....

then you will get the expected results. (right?)

So can you force all the comparisons in your conditions to include a null test, either in the form (... or x is null) or (... and x is not null)?

Tim Sylvester
What you said is right. That is an interesting approach, but a bit more complicated than I hoped. The conditions can be complex, like this:((woclass = 'ACTIVITY' or woclass = 'WORKORDER') and reportdate <= TO_TIMESTAMP ('2001-07-01 23:59:59.000' , 'YYYY-MM-DD HH24:MI:SS.FF') and upper(reportedby) like '%STEVE%' and historyflag = 0 and istask = 0 and (exists (select siteid from locancestor where ((ancestor like '%SEG100%')) and (location =workorder.location and systemid = ( select systemid from locsystem where primarysystem = '1' and siteid =workorder.siteid) and siteid=workorder.siteid)))
Joe Daley
Yeah, post-processing that is a daunting task. Any chance of modifying the code that builds the expressions?
Tim Sylvester
+1  A: 

Assuming your table has a primary key id, one possible approach is:

select count(*)
from people p1
left join people p2
  on (p1.id = p2.id
  and (p2.<condition A>)
  and (p2.<contition B>))
where p1.<condition A>
  and p2.id IS NULL

You do need some simple preprocessing on the conditions (prefacing each column name with p1. or p2. as appropriate), but that's much easier than correctly negating conditions with the NULL issues you mention.

LEFT JOIN sometable ON whatever WHERE ... AND sometable.id IS NULL is a popular way to express "and there's no corresponding record in sometable that satisfied the whatever constraint, so I would expect a good engine to be well tuned to optimize that idiom as much as allowed by the available indices.

Alex Martelli
A: 

If you do have the id field, try:

select count(*) from dual where exists ( select * from people where (cond a) and zzz.id not in (select id from people where (cond b)) );

Thanks. I think this is the fastest solution that avoids preprocessing the conditions. In my testing, the speed is comparable to Alex's solution.
Joe Daley
A: 

One solution is to get rid of any nulls within the parameters of the comparison first, that is, append strings to values or replace nulls with impossible values for that column. For an example of the first:

select x, y
  from foo
  join bar on bar.a||'x' = foo.a||'x' /* Replace "=" with "<>" for opposite result */
;

Replacing nulls:

select x, y
  from foo
  join bar on nvl(bar.a, 'x') = nvl(foo.a, 'x') -- Ditto
;

Now, the second option is harder (at least in Oracle 9.2) because you have to make sure that the replacement value is the same datatype as the column it replaces (NVL is a bit silly like that), and that it's a value outside the precision of the column datatype (e.g., 9999 for number(3)), but it might be possible to make it work with indexes. Of course, this is not possible if the column already uses maximum precision / length.

l0b0
A: 

If for every nullable column you can come up with a dummy value that should never be valid, then you could do something like this:

select count(*) from dual
where exists (
  select * from (
    select NVL(name, 'No Name') name, NVL(age, -1) age from people
    )
  where (<condition A>)
  and not (<condition B>)
);

You would probably want to create function-based indexes on those expressions.

This is certainly easier than parsing through the conditions at runtime and trying to replace the column names with NVL expressions, and it should have the same end result.

Dave Costa