views:

508

answers:

3

I've got a complex SQL where clause that just got more complex because of a requirements change. There are four basic sets of cases, each with a different combination of other factors. It's more readable (in my opinion) to have the four cases as separate branches of the where clause, and to repeat the redundant criteria in each branch. But I don't know how well the database engine will optimize that.

Here's the expression in its redundant form. I've replaced the actual criteria with letters. A is the "branching" criterion, which comes in four forms. All of the expressions are in the form field='value', unless otherwise noted.

A1 AND B AND C AND D
OR A2 AND B AND C AND D AND E AND F1 AND G
OR A3 AND A3a AND B AND C AND D AND E AND F1 AND G
OR A4 AND B AND C AND D AND F2

All of the A's except A4 are in the form field in ('value1','value2'). D is field > 'value'. G is in the form field not in (subquery).

Here's the expression, factored to (I think) its least redundant form.

B AND C AND D AND (
    A1
    OR (
        E AND F1 AND G AND (
            A2
            OR (A3 AND A3a)
        )
    )
    OR (A4 AND F2)

My question is whether I should factor this expression into its simplest (least redundant) logical form, or whether it's OK to keep it in it's more redundant but also more readable form. The target database is Sybase, but I'd like to know the answer for RDMBSs generally.

A: 

I would refactor it. Eventually, duplicated logic will run you into problems. The second example may take a couple seconds longer to understand, but in the big scope of things it's easier to see what is going on as you can quickly look at the entire where clause and begin to determine what affects what.

Kevin
+2  A: 

If I were attacking this problem on M$ SQL Server, I would write it the way I wanted and look at the query execution plan. If it (a) ran slow and (b) had a bad execution plan, then I would refactor and document. Not sure what the mechanism is in Sybase for viewing HOW the optimizer executes the query.

Bill
Thanks for the post. In Sybase you can run SET SHOWPLAN ON before you run your query and it will show you details (although in a text-based format and not a graphical representation like SQL Server's)
John M Gant
+2  A: 

In an RDBMS world I wouldn't bother of redundancy much, efficiency is more important here.

In your case, I would UNION all the four queries using A's as a top condition, like this:

SELECT  *
FROM    mytable
WHERE   A1 AND B AND C
UNION
SELECT  *
FROM    mytable
WHERE   A2 AND B AND C AND D AND E AND F1 AND G
…

I didn't look into Sybase for more than 7 years, but in all major RDBMS's UNION's are more efficient than OR's.

See this article in my blog for approach to a silimar problem in Oracle:

and also this article for comparison of UNION versus OR in MySQL:

I think these approaches will work well for Sybase too.

You also need to create indexes on columns used in your conditions to benefit from UNION's

Update:

Since condition G is a subquery, it may probably happen that it needs a HASH JOIN to perform fast. HASH JOIN requires a full scan on all unfiltered values, that's why it may probable be better to filter all values in a single fullscan and then perform a HASH JOIN:

SELECT  *
FROM    (
        SELECT  *
        FROM    foo
        WHERE   condition_set_1
        UNION
        SELECT  *
        FROM    foo
        WHERE   condition_set_2_but_no_g
        …
        ) q
WHERE   G

To make any further judgements, it will be much better to see the query itself, really.

Quassnoi
Good suggestion. Would the fact that criterion G uses a subquery make the UNION approach less efficient (since theoretically it would happen in two of the SELECTs), or do database engines optimize those types of things?
John M Gant
As for G, it's needed to see the query itself to make any judgements. If the subquery is complex, G could probably kill all benefits from the indexes used on separated UNION'ed queries. Maybe full table scan will be better.
Quassnoi
The subquery logic is simple, but the table is huge. The subquery is selecting on an indexed field, though, so it may not be too bad. I think this specific case is one where I'll have to try it a couple of different ways and see what works best.
John M Gant