Using a numbers table or number-generating CTE, select 0 through 2^n - 1. Using the 1-bit positions in these numbers to indicate the presence or absence of the relative members in the combination, and eliminating those that don't have the correct number of values, you should be able to return a result set with all the combinations you desire.
WITH Nums (Num) AS (
SELECT Num
FROM Numbers
WHERE Num BETWEEN 0 AND POWER(2, @n) - 1
), BaseSet AS (
SELECT ind = Power(2, Row_Number() OVER (ORDER BY Value) - 1), *
FROM @set
), Combos AS (
SELECT
ComboID = N.Num,
S.Value,
Cnt = Count(*) OVER (PARTITION BY N.Num)
FROM
Nums N
INNER JOIN BaseSet S ON N.Num & S.ind <> 0
)
SELECT
ComboID,
Value
FROM Combos
WHERE Cnt = @k
ORDER BY ComboID, Value;
Update
Ok, I tweaked the query to give correct results. I had my @n and @k mixed up in the first CTE. Other than that and a missing *, the query is unchanged. It performs pretty well, but I noticed thought of a way to optimize it, cribbing from the Nifty Parallel Bit Count to get the right number of items taken at a time ahead of time. This performs 3 to 3.5 times faster (both CPU and time):
WITH Nums AS (
SELECT Num, P1 = (Num & 0x55555555) + ((Num / 2) & 0x55555555)
FROM Numbers
WHERE Num BETWEEN 0 AND POWER(2, @n) - 1
), Nums2 AS (
SELECT Num, P2 = (P1 & 0x33333333) + ((P1 / 4) & 0x33333333)
FROM Nums
), Nums3 AS (
SELECT Num, P3 = (P2 & 0x0f0f0f0f) + ((P2 / 16) & 0x0f0f0f0f)
FROM Nums2
), BaseSet AS (
SELECT ind = Power(2, Row_Number() OVER (ORDER BY Value) - 1), *
FROM @set
)
SELECT
ComboID = N.Num,
S.Value
FROM
Nums3 N
INNER JOIN BaseSet S ON N.Num & S.ind <> 0
WHERE P3 % 255 = @k
ORDER BY ComboID, Value;
I went and read the bit-counting page and think that this could perform better if I don't do the % 255 but go all the way with bit arithmetic. When I get a chance I'll try that and see how it stacks up.
My performance claims are based on the queries run without the ORDER BY clause. For clarity, what this code is doing is counting the number of set 1-bits in Num from the Numbers table. That's because the number is being used as a sort of indexer to choose which elements of the set are in the current combination, so the number of 1-bits will be the same.
I hope you like it!
May I also suggest this change to your Factorial UDF:
ALTER FUNCTION dbo.Factorial (
@x bigint
)
RETURNS bigint
AS
BEGIN
IF @x <= 1 RETURN 1
RETURN @x * dbo.Factorial(@x - 1)
END
Now you can calculate much larger sets of combinations, plus it's more efficient.
Just for fun here's what I used for my performance testing with big sets:
DECLARE
@k int,
@n int;
DECLARE @set TABLE (value varchar(24));
INSERT @set VALUES ('A'),('B'),('C'),('D'),('E'),('F'),('G'),('H'),('I'),('J'),('K'),('L'),('M'),('N'),('O'),('P'),('Q'); --,('R'),('S');
SET @n = @@RowCount;
SET @k = 5;
DECLARE @combinations bigint = dbo.Factorial(@n) / (dbo.Factorial(@k) * dbo.Factorial(@n - @k));
SELECT CAST(@combinations as varchar(max)) + ' combinations', MaxNumUsedFromNumbersTable = POWER(2, @n);
Note that you could use an on-the-fly numbers table, but it may not perform as well. I haven't tried it.
Update 2
I looked at your query and found that it is not correct. For example, using my test data, set 1 is the same as set 18. It looks like your query takes a sliding stripe that wraps around: each set is always 5 adjacent members, looking something like this:
1 ABCDE
2 ABCD Q
3 ABC PQ
4 AB OPQ
5 A NOPQ
6 MNOPQ
7 LMNOP
8 KLMNO
9 JKLMN
10 IJKLM
11 HIJKL
12 GHIJK
13 FGHIJ
14 EFGHI
15 DEFGH
16 CDEFG
17 BCDEF
18 ABCDE
19 ABCD Q
Comparing the pattern from my queries:
31 ABCDE
47 ABCD F
55 ABC EF
59 AB DEF
61 A CDEF
62 BCDEF
79 ABCD G
87 ABC E G
91 AB DE G
93 A CDE G
94 BCDE G
103 ABC FG
107 AB D FG
109 A CD FG
110 BCD FG
115 AB EFG
117 A C EFG
118 BC EFG
121 A DEFG
Just to drive the bit-pattern -> index of combination thing home for anyone interested, notice that 31 in binary = 11111 and the pattern is ABCDE. 121 in binary is 1111001 and the pattern is A__DEFG (backwards mapped).
For the record, this technique of using the bit pattern of integers to select members of a set is what I've coined the "Vertical Cross Join." It effectively results in the cross join of multiple sets of data, where the number of sets & cross joins is arbitrary. Here, the number of sets is the number of items taken at a time.
That is, actually cross joining would look something like this:
SELECT
A.Value,
B.Value,
C.Value
FROM
@Set A
CROSS JOIN @Set B
CROSS JOIN @Set C
WHERE
A.Value = 'A'
AND B.Value = 'B'
AND C.Value = 'C'
But the queries above cross join as many times as necessary with only one join. The results are unpivoted compared to actual cross joins, sure, but that's a minor matter.
Update 3
Peter showed that my "vertical cross join" doesn't perform as well as simply writing dynamic SQL to actually do the CROSS JOINs it avoids. At the trivial cost of a few more reads, his solution has metrics between 10 and 17 times better. The performance of his query decreases faster than mine as the amount of work increases, but not fast enough to stop anyone from using it.
The second set of numbers below is the factor as divided by the first row in the table, just to show how it scales.
Emtucifor
Items CPU Writes Reads Duration | CPU Writes Reads Duration
----- ------ ------ ------- -------- | ----- ------ ------ --------
17•5 7344 0 3861 8531 |
18•9 17141 0 7748 18536 | 2.3 2.0 2.2
20•10 76657 0 34078 84614 | 10.4 8.8 9.9
21•11 163859 0 73426 176969 | 22.3 19.0 20.7
21•20 142172 0 71198 154441 | 19.4 18.4 18.1
Peter
Items CPU Writes Reads Duration | CPU Writes Reads Duration
----- ------ ------ ------- -------- | ----- ------ ------ --------
17•5 422 70 10263 794 |
18•9 6046 980 219180 11148 | 14.3 14.0 21.4 14.0
20•10 24422 4126 901172 46106 | 57.9 58.9 87.8 58.1
21•11 58266 8560 2295116 104210 | 138.1 122.3 223.6 131.3
21•20 51391 5 6291273 55169 | 121.8 0.1 613.0 69.5
Extrapolating, eventually my query will be cheaper (though it is from the start in reads), but not for a long time. To use 21 items in the set already requires a numbers table going up to 2097152...
I love single-query solutions to problems like this, but if you're looking for the best performance, an actual cross-join is best, unless you start dealing with seriously huge numbers of combination. But what does anyone want with hundreds of thousands or even millions of rows? Even the growing number of reads don't seem too much of a problem, though 6 million is a lot and it's getting bigger fast...
Anyway. Dynamic SQL wins. I still had a beautiful query. :)