views:

124

answers:

5

Suppose you have a table RULES with 3 columns A, B, and C. As data enters the system, I want to know if any row of the RULES table matches my data with the condition that if the corresponding column in the RULES table is null, all data matches. The obvious SQL is:

SELECT * FROM RULES
WHERE (A = :a OR A IS NULL)
  AND (B = :b OR B IS NULL)
  AND (C = :c OR C IS NULL)

So if I have rules:

RULE    A        B        C
1       50       NULL     NULL
2       51       xyz      NULL
3       51       NULL     123
4       NULL     xyz      456

An input of (50, xyz, 456) will match rules 1 and 4.

Question: Is there a better way to do this? With only 3 fields this is no problem. But the actual table will have 15 columns and I worry about how well that SQL scales.

Speculation: An alternative SQL statement I came up with involved adding an extra column to the table with a count of how many fields are not null. (So in the example, this columns value for rules 1-4 is 1, 2, 2 and 2 respectively.) With this "col_count" column, the select could be:

SELECT * FROM RULES
WHERE (CASE WHEN A = :a THEN 1 ELSE 0 END)
    + (CASE WHEN B = :b THEN 1 ELSE 0 END)
    + (CASE WHEN C = :c THEN 1 ELSE 0 END)
    = COL_COUNT

Unfortunately, I don't have enough sample data to find our which of these approaches would perform better. Before I start creating random rules, I thought I'd ask here whether there was a better approach.

Note: Data mining techniques and column constraints are not feasible here. The data must be checked as it enters the system and so it can be flagged pass/fail immediately. And, the users control the addition or removal of rules so I can't convert the rules into column constraints or other data definition statements.

One last thing, in the end I need a list of all the rules that the data fails to pass. The solution cannot abort at the first failure.

Thanks.

A: 

Are there too many rows/rules? If it's not the case (that's subjective, but say less than 10,000), you could create indexes for all columns.

That would increase speed significantly and indexes won't take much space.

If you don't plan to make a huge table of rules, then I bet your approach is ok provided you index all columns.

Seb
In Oracle, nulls are not indexed. Then the IS NULL cannot take advantage on indexes.
FerranB
+1  A: 

The first query you provided is perfect. I really doubt that adding the column you were speaking of would give you any more speed, since the NOT NULL property of every entry is checked anyway, since every comparison to NULL yields false. So I would guess that x=y is expanded to x IS NOT NULL AND x=y internally. Maybe someone else can clarify that.

All other optimizations I can think of would involve precalculation or caching. You can create [temporary] tables matching certain rules or add further columns holding matching rules.

soulmerge
A: 

Why not make indices of your rules table by the values? Then you can

SELECT myvalue FROM RULES_A
Charlie Martin
So, you are saying select against 3 separate tables rather than a single table? (Or 15 tables as in the actual case)
jmucchiello
A: 

It sounds like what you really have are Rules and Rule Sets. Modeling it that way will not only make this particular coding much simpler, but will also make the model expandable when you decide that you need a 16 columns.

For example:

CREATE TABLE Rules (
    rule_id   INT   NOT NULL,
    rule_category CHAR(1)  NOT NULL, -- This is like your column idea
    rule_int_value INT   NULL,
    rule_str_value VARCHAR(20) NULL,
    CONSTRAINT PK_Rules PRIMARY KEY CLUSTERED (rule_id),
    CONSTRAINT CK_Rules_one_value CHECK (rule_int_value IS NULL OR rule_str_value IS NULL)
)

CREATE TABLE Rule_Sets (
    rule_set_id INT NOT NULL,
    rule_id  INT NOT NULL,
    CONSTRAINT PK_Rule_Sets PRIMARY KEY CLUSTERED (rule_set_id, rule_id)
)

Some data that would match your given rules:

INSERT INTO Rules (rule_id, rule_category, rule_int_value, rule_str_value)
VALUES (1, 'A', 50, NULL)
INSERT INTO Rules (rule_id, rule_category, rule_int_value, rule_str_value)
VALUES (2, 'A', 51, NULL)
INSERT INTO Rules (rule_id, rule_category, rule_int_value, rule_str_value)
VALUES (3, 'B', NULL, 'xyz')
INSERT INTO Rules (rule_id, rule_category, rule_int_value, rule_str_value)
VALUES (4, 'C', 123, NULL)
INSERT INTO Rules (rule_id, rule_category, rule_int_value, rule_str_value)
VALUES (5, 'C', 456, NULL)

INSERT INTO Rule_Sets (rule_set_id, rule_id) VALUES (1, 1)
INSERT INTO Rule_Sets (rule_set_id, rule_id) VALUES (2, 2)
INSERT INTO Rule_Sets (rule_set_id, rule_id) VALUES (2, 3)
INSERT INTO Rule_Sets (rule_set_id, rule_id) VALUES (3, 2)
INSERT INTO Rule_Sets (rule_set_id, rule_id) VALUES (3, 4)
INSERT INTO Rule_Sets (rule_set_id, rule_id) VALUES (4, 3)
INSERT INTO Rule_Sets (rule_set_id, rule_id) VALUES (4, 5)

A test script that confirms the same answer that you expect:

DECLARE
    @a INT,
    @b VARCHAR(20),
    @c INT

SET @a = 50
SET @b = 'xyz'
SET @c = 456

SELECT DISTINCT
    rule_set_id AS failed_rule_set_id
FROM
    Rule_Sets RS
WHERE
    NOT EXISTS (SELECT * FROM Rules R WHERE R.rule_id = RS.rule_id AND @a = R.rule_int_value) AND
    NOT EXISTS (SELECT * FROM Rules R WHERE R.rule_id = RS.rule_id AND @b = R.rule_str_value) AND
    NOT EXISTS (SELECT * FROM Rules R WHERE R.rule_id = RS.rule_id AND @c = R.rule_int_value)

If you can present the input data in a set-based form rather than as individual parameters then the final SQL statement can be more dynamic and wouldn't have to grow as you add additional columns.

Tom H.
But does this run faster than the "is equal or is null" SQL? In the real model there will be 15 columns, not 3. There are some columns where there are few values and this is useful but there are other columns where the values are probably unique by rule. I'll have to run some models....
jmucchiello
Actually this doesn't work. It return 2,3. I want 1,4 to be returned as in my example. I still need to wrap it in a "select ... not in ()"
jmucchiello
I was going by your last statement that you wanted a list of all of the rules for which the data fails to pass. If your database is properly indexed I imagine most of the methods will work ok. The real problem is that your criteria is coming in as many variables instead of a set.
Tom H.
A: 
SELECT * FROM RULES
 WHERE (A = :a OR A IS NULL)
   AND (B = :b OR B IS NULL)
   AND (C = :c OR C IS NULL);

Depending on your RBDMS, this might or might not be more efficient, though not by much:

SELECT * FROM RULES
 WHERE coalesce(A, :a) = :a
   AND coalesce(B, :b) = :b 
   AND coalesce(C, :c) = :c ;

In MySQL (your RBDMS may do this differently), this query allows an index scan rather than a ref_or_null scan, if there is an applicable index. If the index covers all columns, it allows the entire index to be used (and indeed, if the index covers all columns, the index is the table).

With your query, a ref_or_null access is done rather than an index access, and only the first column in a multi-column index is used. With ref_or_null, MySQL has to search the index for matches, then search again for nulls. So we use the index twice, but never use the whole index.

But with coalesce, you have the overhead of executing the coalesce function on each column value. Which is faster probably depends on how many rules you have, how many columns in each row, and the index used, if any.

Whether it's more readable is a matter of opinion.

tpdi