views:

496

answers:

3

I don't do a lot of SQL,and most of the time, I'm doing CRUD operations. Occasionally I'll get something a bit more complicated. So, this question may be a newbie question, but I'm ready. I've just been trying to figure this out for hours, and it's been no use.

So, Imagine the following table structure:

> | ID | Col1 | Col2 | Col3 | .. | Col8 |

I want to select ID and a calculated column. The calculated column has a range of 0 - 8 and it contains the number of matches to the query. I also want to restrict the result set to only include rows that have a certain number of matches.

So, from this sample data:

> | 1 | 'a' | 'b' | 1 | 2 |  
> | 2 | 'b' | 'c' | 1 | 2 |  
> | 3 | 'b' | 'c' | 4 | 5 |  
> | 4 | 'x' | 'x' | 9 | 9 |

I want to query on Col1 = 'a' OR Col2 = 'c' OR Col3 = 1 OR Col4 = 5 where the calculated result > 1 and have the result set look like:

> | ID | Cal |
> | 1  |  2  |
> | 2  |  2  |
> | 3  |  2  |

I'm using T-SQL and SQL Server 2005, if it matters, and I can't change the DB Schema.

I'd also prefer to keep it as one self-contained query and not have to create a stored procedure or temporary table.

+2  A: 

Here's a solution that leverages the fact that a boolean comparison returns the integers 1 or 0:

SELECT * FROM (
  SELECT ID, (Col1='a') + (Col2='c') + (Col3=1) + (Col4=5) AS calculated
  FROM MyTable
) q
WHERE calculated > 1;

Note that you have to parenthesize the boolean comparisons because + has higher precedence than =. Also, you have to put it all in a subquery because you normally can't use a column alias in a WHERE clause of the same query.

It might seem like you should also use a WHERE clause in the subquery to restrict its rows, but in all likelihood you're going to end up with a full table scan anyway so it's probably not a big win. On the other hand, if you expect that such a restriction would greatly reduce the number of rows in the subquery result, then it'd be worthwhile.


Re Quassnoi's comment, if you can't treat boolean expressions as integer values, there should be a way to map boolean conditions to integers, even if it's a bit verbose. For example:

SELECT * FROM (
  SELECT ID, 
      CASE WHEN Col1='a' THEN 1 ELSE 0 END
    + CASE WHEN Col2='c' THEN 1 ELSE 0 END 
    + CASE WHEN Col3=1   THEN 1 ELSE 0 END
    + CASE WHEN Col4=5   THEN 1 ELSE 0 END AS calculated
  FROM MyTable
) q
WHERE calculated > 1;
Bill Karwin
Alternatively, you could avoid the subquery at the cost of repeating the same `(Col1='a') + (Col2='c') + ...` in the `WHERE` clause, which SQL Server should optimize away (making the longer query the only cost)
VoteyDisciple
This will work in `MySQL` but not in `SQL Server`, it cannot cast booleans to integers.
Quassnoi
@VoteyDisciple: Yes, I was thinking of the same point and I was adding the last paragraph in the answer above as you were writing your comment.
Bill Karwin
@Quassnoi: Really? How inconvenient. Well one can use `CASE` or something. I'll edit the answer to show an example.
Bill Karwin
I get the error Incorrect syntax near '=' and it references the line with the boolean values. What am I missing?
ahsteele
In SQL Server, I couldn't get this to execute - "Incorrect syntax near the keyword 'WHERE'" - I'm guessing it doesn't like the column alias
Thanks to Quassnoi for adding a table alias for my derived tables. That should help.
Bill Karwin
+1  A: 

This query is more index friendly:

SELECT  id, SUM(match)
FROM    (
        SELECT  id, 1 AS match
        FROM    mytable
        WHERE   col1 = 'a'
        UNION ALL
        SELECT  id, 1 AS match
        FROM    mytable
        WHERE   col2 = 'c'
        UNION ALL
        SELECT  id, 1 AS match
        FROM    mytable
        WHERE   col3 = 1
        UNION ALL
        SELECT  id, 1 AS match
        FROM    mytable
        WHERE   col4 = 5
        ) q
GROUP BY
        id
HAVING  SUM(match) > 1

This will only be efficient if all the columns you are searching for are, first, indexed and, second, have high cardinality (many distinct values).

See this article in my blog for performance details:

Quassnoi
The problem with this is that it means you have to do index seeks for every item you filter on and it assumes that id is a clustered PK. If one of these isn't indexed, you end up doing a table scan plus index seeks (worst case 4 table scans). This is going to be suboptimal, where a single table scan would do.
Strommy
`@Strommy`: you're right, and that's exactly what the post says. But if the columns are selective, this solution can be optimal.
Quassnoi
@Quassnoi: I agree with you. Both work and yours has the possibility of being much faster assuming index coverage. There's more than one way to skin a cat, I guess...
Strommy
+4  A: 

This answer will work with SQL 2005, using a CTE to clean up the derived table a little.

WITH Matches AS
(
    SELECT ID, CASE WHEN Col1 = 'a' THEN 1 ELSE 0 END + 
       CASE WHEN Col2 = 'c' THEN 1 ELSE 0 END +
       CASE WHEN Col3 = 1  THEN 1 ELSE 0 END +
       CASE WHEN Col4 = 5  THEN 1 ELSE 0 END AS Result
    FROM Table1
    WHERE Col1 = 'a' OR Col2 = 'c' OR Col3 = 1 OR Col4 = 5 
)
SELECT ID, Result
FROM Matches
WHERE Result > 1
Strommy
This works great! Thanks. Quassnoi's solution worked also.