views:

165

answers:

2

I have a need to generate all combinations of size @k in a given set of size @n. Can someone please review the following SQL and determine first if the following logic is returning the expected results. And secondly, is there a better way?


/*CREATE FUNCTION dbo.Factorial ( @x int ) 
RETURNS int 
AS
BEGIN
    DECLARE @value int

    IF @x <= 1
        SET @value = 1
    ELSE
        SET @value = @x * dbo.Factorial( @x - 1 )

    RETURN @value
END
GO*/
SET NOCOUNT ON
DECLARE @k int = 5, @n int
DECLARE @set table ( [value] varchar(24) )
DECLARE @com table ( [index] int )

INSERT @set VALUES ('1'),('2'),('3'),('4'),('5'),('6')

SELECT @n = COUNT(*) FROM @set

DECLARE @combinations int = dbo.Factorial(@n) / (dbo.Factorial(@k) * dbo.Factorial(@n - @k))

PRINT CAST(@combinations as varchar(max)) + ' combinations'

DECLARE @index int = 1

WHILE @index <= @combinations
BEGIN
    INSERT @com VALUES (@index)
    SET @index = @index + 1
END

;WITH [set] as (
    SELECT 
        [value], 
        ROW_NUMBER() OVER ( ORDER BY [value] ) as [index]
    FROM @set
)
SELECT 
    [values].[value], 
    [index].[index] as [combination]
FROM [set] [values]
CROSS JOIN @com [index]
WHERE ([index].[index] + [values].[index] - 1) % (@n) BETWEEN 1 AND @k
ORDER BY
    [index].[index]


+6  A: 

Using a numbers table or number-generating CTE, select 0 through 2^n - 1. Using the 1-bit positions in these numbers to indicate the presence or absence of the relative members in the combination, and eliminating those that don't have the correct number of values, you should be able to return a result set with all the combinations you desire.

WITH Nums (Num) AS (
   SELECT Num
   FROM Numbers
   WHERE Num BETWEEN 0 AND POWER(2, @n) - 1
), BaseSet AS (
   SELECT ind = Power(2, Row_Number() OVER (ORDER BY Value) - 1), *
   FROM @set
), Combos AS (
   SELECT
      ComboID = N.Num,
      S.Value,
      Cnt = Count(*) OVER (PARTITION BY N.Num)
   FROM
      Nums N
      INNER JOIN BaseSet S ON N.Num & S.ind <> 0
)
SELECT
   ComboID,
   Value
FROM Combos
WHERE Cnt = @k
ORDER BY ComboID, Value;

Update

Ok, I tweaked the query to give correct results. I had my @n and @k mixed up in the first CTE. Other than that and a missing *, the query is unchanged. It performs pretty well, but I noticed thought of a way to optimize it, cribbing from the Nifty Parallel Bit Count to get the right number of items taken at a time ahead of time. This performs 3 to 3.5 times faster (both CPU and time):

WITH Nums AS (
   SELECT Num, P1 = (Num & 0x55555555) + ((Num / 2) & 0x55555555)
   FROM Numbers
   WHERE Num BETWEEN 0 AND POWER(2, @n) - 1
), Nums2 AS (
   SELECT Num, P2 = (P1 & 0x33333333) + ((P1 / 4) & 0x33333333)
   FROM Nums
), Nums3 AS (
   SELECT Num, P3 = (P2 & 0x0f0f0f0f) + ((P2 / 16) & 0x0f0f0f0f)
   FROM Nums2
), BaseSet AS (
   SELECT ind = Power(2, Row_Number() OVER (ORDER BY Value) - 1), *
   FROM @set
)
SELECT
   ComboID = N.Num,
   S.Value
FROM
   Nums3 N
   INNER JOIN BaseSet S ON N.Num & S.ind <> 0
WHERE P3 % 255 = @k
ORDER BY ComboID, Value;

I went and read the bit-counting page and think that this could perform better if I don't do the % 255 but go all the way with bit arithmetic. When I get a chance I'll try that and see how it stacks up.

My performance claims are based on the queries run without the ORDER BY clause. For clarity, what this code is doing is counting the number of set 1-bits in Num from the Numbers table. That's because the number is being used as a sort of indexer to choose which elements of the set are in the current combination, so the number of 1-bits will be the same.

I hope you like it!

May I also suggest this change to your Factorial UDF:

ALTER FUNCTION dbo.Factorial (
   @x bigint
)
RETURNS bigint
AS
BEGIN
   IF @x <= 1 RETURN 1
   RETURN @x * dbo.Factorial(@x - 1)
END

Now you can calculate much larger sets of combinations, plus it's more efficient.

Just for fun here's what I used for my performance testing with big sets:

DECLARE
   @k int,
   @n int;

DECLARE @set TABLE (value varchar(24));
INSERT @set VALUES ('A'),('B'),('C'),('D'),('E'),('F'),('G'),('H'),('I'),('J'),('K'),('L'),('M'),('N'),('O'),('P'),('Q'); --,('R'),('S');
SET @n = @@RowCount;
SET @k = 5;

DECLARE @combinations bigint = dbo.Factorial(@n) / (dbo.Factorial(@k) * dbo.Factorial(@n - @k));
SELECT CAST(@combinations as varchar(max)) + ' combinations', MaxNumUsedFromNumbersTable = POWER(2, @n);

Note that you could use an on-the-fly numbers table, but it may not perform as well. I haven't tried it.

Update 2

I looked at your query and found that it is not correct. For example, using my test data, set 1 is the same as set 18. It looks like your query takes a sliding stripe that wraps around: each set is always 5 adjacent members, looking something like this:

 1 ABCDE            
 2 ABCD            Q
 3 ABC            PQ
 4 AB            OPQ
 5 A            NOPQ
 6             MNOPQ
 7            LMNOP 
 8           KLMNO  
 9          JKLMN   
10         IJKLM    
11        HIJKL     
12       GHIJK      
13      FGHIJ       
14     EFGHI        
15    DEFGH         
16   CDEFG          
17  BCDEF           
18 ABCDE            
19 ABCD            Q

Comparing the pattern from my queries:

 31 ABCDE  
 47 ABCD F 
 55 ABC EF 
 59 AB DEF 
 61 A CDEF 
 62  BCDEF 
 79 ABCD  G
 87 ABC E G
 91 AB DE G
 93 A CDE G
 94  BCDE G
103 ABC  FG
107 AB D FG
109 A CD FG
110  BCD FG
115 AB  EFG
117 A C EFG
118  BC EFG
121 A  DEFG

Just to drive the bit-pattern -> index of combination thing home for anyone interested, notice that 31 in binary = 11111 and the pattern is ABCDE. 121 in binary is 1111001 and the pattern is A__DEFG (backwards mapped).

For the record, this technique of using the bit pattern of integers to select members of a set is what I've coined the "Vertical Cross Join." It effectively results in the cross join of multiple sets of data, where the number of sets & cross joins is arbitrary. Here, the number of sets is the number of items taken at a time.

That is, actually cross joining would look something like this:

SELECT
   A.Value,
   B.Value,
   C.Value
FROM
   @Set A
   CROSS JOIN @Set B
   CROSS JOIN @Set C
WHERE
   A.Value = 'A'
   AND B.Value = 'B'
   AND C.Value = 'C'

But the queries above cross join as many times as necessary with only one join. The results are unpivoted compared to actual cross joins, sure, but that's a minor matter.

Update 3

Peter showed that my "vertical cross join" doesn't perform as well as simply writing dynamic SQL to actually do the CROSS JOINs it avoids. At the trivial cost of a few more reads, his solution has metrics between 10 and 17 times better. The performance of his query decreases faster than mine as the amount of work increases, but not fast enough to stop anyone from using it.

The second set of numbers below is the factor as divided by the first row in the table, just to show how it scales.

Emtucifor

Items  CPU   Writes  Reads  Duration |  CPU  Writes  Reads Duration
----- ------ ------ ------- -------- | ----- ------ ------ --------
17•5    7344     0     3861    8531  |
18•9   17141     0     7748   18536  |   2.3          2.0      2.2
20•10  76657     0    34078   84614  |  10.4          8.8      9.9
21•11 163859     0    73426  176969  |  22.3         19.0     20.7
21•20 142172     0    71198  154441  |  19.4         18.4     18.1

Peter

Items  CPU   Writes  Reads  Duration |  CPU  Writes  Reads Duration
----- ------ ------ ------- -------- | ----- ------ ------ --------
17•5     422    70    10263     794  | 
18•9    6046   980   219180   11148  |  14.3   14.0   21.4    14.0
20•10  24422  4126   901172   46106  |  57.9   58.9   87.8    58.1
21•11  58266  8560  2295116  104210  | 138.1  122.3  223.6   131.3
21•20  51391     5  6291273   55169  | 121.8    0.1  613.0    69.5

Extrapolating, eventually my query will be cheaper (though it is from the start in reads), but not for a long time. To use 21 items in the set already requires a numbers table going up to 2097152...

I love single-query solutions to problems like this, but if you're looking for the best performance, an actual cross-join is best, unless you start dealing with seriously huge numbers of combination. But what does anyone want with hundreds of thousands or even millions of rows? Even the growing number of reads don't seem too much of a problem, though 6 million is a lot and it's getting bigger fast...

Anyway. Dynamic SQL wins. I still had a beautiful query. :)

Emtucifor
This query does not return any results using the Nums CTE `WITH Nums (Num) AS (SELECT 0 UNION ALL SELECT Nums.Num + 1 FROM Nums WHERE Nums.Num < POWER(2, @k) - 1)`. I tried changing the `WHERE Cnt = @n` to `WHERE Cnt = @k` and it incorrectly returned one set.
Dave
Thanks, Dave. I'm really busy today but I'll try to put some time into it soon to get a correct answer. I confess this was off the top of my head...
Emtucifor
I'm in the same boat, thanks for the response though!
Dave
Wow, just wow! :)
Denis Valeev
Indeed! Wow! Thanks for your thoroughness!
Dave
correct...clear...concise...oh my! clearly, my bit-twiddling abilities are not up to snuff. But a serious question, as I try to understand how this works, and hopefully apply it to other scenarios: what's the upper limit on @k and @n?
Peter
@Peter, the limit here is the size of the numbers table. Either switch to an on-the-fly numbers table (and let me know how that goes please) or you need 2^n rows out there. My testing in the past has shown that a real table with a clustered index beats the on-the-fly table. Note: see my performance comparison with your solution. I don't know how much tempdb either solution uses, but this could be a factor as well.
Emtucifor
@Peter to understand the bit twiddling make sure you follow the link in the post. Basically, this bit-counting technique turns slots of bits into counts of bits, doubling in size each time. After the first step, each 2 bits holds the count of 1-bits originally in those 2 bits (0, 1, or 2). After step to, each 4 bits holds the count for each group of 4-bits (0 through 4). Then each byte holds the value 0-8. And some fiddling and analysis (I had to do it too) will eventually show how % 255 adds these 4 values to get a final count of bits.
Emtucifor
@Peter This decimal version works the same way: `select 004008005009 % 999 = 26`, which is 4 + 8 + 5 + 9. And `select 123 % 9` = 6 which is 1 + 2 + 3. It only works if the added value is guaranteed to not exceed the modulus, like `select 234 % 9` = 0, the wrong answer.
Emtucifor
@Emtucifor: wow look at those CPU vs IO costs. Any way to parallelize your version over several cores? Or does it already generate a parallel execution plan on your machine?
Peter
It *is* a beautiful query. Anyone can write dynamic SQL. How many can change a single join into an arbitrary number of joins?
Peter
You can, now! Although I think it's the other way around, changing an arbitrary number of joins to a single join.
Emtucifor
+2  A: 

How about some dynamic SQL?

DECLARE @k int = 5, @n INT

IF OBJECT_ID('tempdb..#set') IS NOT NULL DROP TABLE #set
CREATE TABLE #set ( [value] varchar(24) )
INSERT #set VALUES ('1'),('2'),('3'),('4'),('5'),('6')
SET @n = @@ROWCOUNT

SELECT dbo.Factorial(@n) / (dbo.Factorial(@k) * dbo.Factorial(@n - @k)) AS [expected combinations]

-- let's generate some sql.
DECLARE 
  @crlf     NCHAR(2) = NCHAR(13)+NCHAR(10)
, @sql      NVARCHAR(MAX)
, @select   NVARCHAR(MAX)
, @from     NVARCHAR(MAX)
, @order    NVARCHAR(MAX)
, @in       NVARCHAR(MAX)

DECLARE @j INT = 0
WHILE @j < @k BEGIN 
  SET @j += 1
  IF @j = 1 BEGIN
    SET @select = 'SELECT'+@crlf+'  _1.value AS [1]'
    SET @from   = @crlf+'FROM #set AS _1'
    SET @order  = 'ORDER BY _1.value'
    SET @in     = '[1]'
  END 
  ELSE BEGIN
    SET @select += @crlf+', _'+CONVERT(VARCHAR,@j)+'.value AS ['+CONVERT(VARCHAR,@j)+']'
    SET @from   += @crlf+'INNER JOIN #set AS _'+CONVERT(VARCHAR,@j)+' ON _'+CONVERT(VARCHAR,@j)+'.value > _'+CONVERT(VARCHAR,@j-1)+'.value'
    SET @order  += ', _'+CONVERT(VARCHAR,@j)+'.value' 
    SET @in     += ', ['+CONVERT(VARCHAR,@j)+']'
  END
END
SET @select += @crlf+', ROW_NUMBER() OVER ('+@order+') AS combination'
SET @sql = @select + @from

-- let's see how it looks
PRINT @sql
EXEC (@sql)

-- ok, now dump pivot and dump into a table for later use
IF OBJECT_ID('tempdb..#combinations') IS NOT NULL DROP TABLE #combinations
CREATE TABLE #combinations (
  combination INT
, value VARCHAR(24)
, PRIMARY KEY (combination, value)
)

SET @sql 
= 'WITH CTE AS ('+@crlf+@sql+@crlf+')'+@crlf
+ 'INSERT #combinations (combination, value)'+@crlf
+ 'SELECT combination, value FROM CTE a'+@crlf
+ 'UNPIVOT (value FOR position IN ('+@in+')) AS b'

PRINT @sql
EXEC (@sql)

SELECT COUNT(DISTINCT combination) AS [returned combinations] FROM #combinations
SELECT * FROM #combinations

Generates the following query for @k = 5:

SELECT
  _1.value AS [1]
, _2.value AS [2]
, _3.value AS [3]
, _4.value AS [4]
, _5.value AS [5]
, ROW_NUMBER() OVER (ORDER BY _1.value, _2.value, _3.value, _4.value, _5.value) AS combination
FROM #set AS _1
INNER JOIN #set AS _2 ON _2.value > _1.value
INNER JOIN #set AS _3 ON _3.value > _2.value
INNER JOIN #set AS _4 ON _4.value > _3.value
INNER JOIN #set AS _5 ON _5.value > _4.value

Which it then unpivots and dumps into a table.

The dynamic SQL is ugly, and you can't wrap it in a UDF, but the query produced is very efficient.

Peter
I'd like to wrap into a UDF however this could come in handy though sometime. Thanks!
Dave
Put it in an SP and you can INSERT ... EXEC it.
Emtucifor
Do you need `#set`? Can't you just generate a CTE for it?
Gabe
@Gabe I think the point is that the #set table holds arbitrary things... not just numbers but perhaps people names or ingredients or any other thing that someone might want to display combinations of.
Emtucifor