tags:

views:

82

answers:

3

I am trying to keep a rolling checksum to account for order, so take the previous 'checksum' and xor it with the current one and generate a new checksum.

Name      Checksum     Rolling Checksum
------    -----------  -----------------
foo       11829231     11829231
bar       27380135     checksum(27380135 ^ 11829231) = 93291803
baz       96326587     checksum(96326587 ^ 93291803) = 67361090

How would I accomplish something like this?

(Note that the calculations are completely made up and are for illustration only)

+1  A: 
Select Name, Checksum
    , (Select T1.Checksum_Agg(Checksum)
        From Table As T1
        Where T1.Name < T.Name) As RollingChecksum
From Table As T
Order By T.Name

To do a rolling anything, you need some semblance of an order to the rows. That can be by name, an integer key, a date or whatever. In my example, I used name (even though the order in your sample data isn't alphabetical). In addition, I'm using the Checksum_Agg function in SQL.

In addition, you would ideally have a unique value on which you compare the inner and outer query. E.g., Where T1.PK < T.PK for an integer key or even string key would work well. In my solution if Name had a unique constraint, it would also work well enough.

Thomas
+1  A: 

I'm not sure about a rolling checksum, but for a rolling sum for instance, you can do this using the UPDATE command:

declare @a table (name varchar(2), value int, rollingvalue int)
insert into @a
    select 'a', 1, 0 union all select 'b', 2, 0 union all select 'c', 3, 0 

select * from @a

declare @sum int
set @sum = 0

update @a
set @sum =  rollingvalue = value + @sum 

select * from @a
Corina
+1 I think there are some caveats to observe but this can work from a very quick skim of an article here http://www.sqlservercentral.com/articles/T-SQL/68467/
Martin Smith
+2  A: 

This is basically the running total problem.

Edit:

My original claim was that is one of the few places where a cursor based solution actually performs best. The problem with the triangular self join solution is that it will repeatedly end up recalculating the same cumulative checksum as a subcalculation for the next step so is not very scalable as the work required grows exponentially with the number of rows.

Corina's answer uses the "quirky update" approach. I've adjusted it to do the check sum and in my test found that it took 3 seconds rather than 26 seconds for the cursor solution. Both produced the same results. Unfortunately however it relies on an undocumented aspect of Update behaviour. I would definitely read the discussion here before deciding whether to rely on this in production code.

There is a third possibility described here (using the CLR) which I didn't have time to test. But from the discussion here it seems to be a good possibility for calculating running total type things at display time but out performed by the cursor when the result of the calculation must be saved back.

CREATE TABLE TestTable
(
PK int identity(1,1) primary key clustered,
[Name] varchar(50),
[CheckSum] AS CHECKSUM([Name]),
RollingCheckSum1 int NULL,
RollingCheckSum2 int NULL
)


/*Insert some random records (753,571 on my machine)*/
INSERT INTO TestTable ([Name])
SELECT newid() FROM sys.objects s1, sys.objects s2, sys.objects s3

Approach One: Based on the Jeff Moden Article

DECLARE @RCS int

 UPDATE TestTable
    SET @RCS = RollingCheckSum1 = 
                                 CASE WHEN @RCS IS NULL THEN 
                                                        [CheckSum] 
                                 ELSE 
                                             CHECKSUM([CheckSum] ^ @RCS) 
                                 END
   FROM TestTable WITH (TABLOCKX)
 OPTION (MAXDOP 1)

Approach Two - Using the same cursor options as Hugo Kornelis advocates in the discussion for that article.

SET NOCOUNT ON
BEGIN TRAN

DECLARE @RCS2 INT
DECLARE @PK INT, @CheckSum INT

DECLARE curRollingCheckSum CURSOR LOCAL STATIC READ_ONLY
    FOR
    SELECT PK, [CheckSum]
    FROM         TestTable
    ORDER BY PK

   OPEN curRollingCheckSum

  FETCH NEXT FROM curRollingCheckSum
   INTO @PK, @CheckSum

  WHILE @@FETCH_STATUS = 0
  BEGIN

  SET @RCS2 = CASE WHEN @RCS2 IS NULL THEN @CheckSum ELSE CHECKSUM(@CheckSum ^ @RCS2) END


 UPDATE dbo.TestTable 
    SET RollingCheckSum2 = @RCS2
  WHERE @PK = PK

  FETCH NEXT FROM curRollingCheckSum
   INTO @PK, @CheckSum

    END

COMMIT

Test they are the same

SELECT * FROM TestTable
WHERE RollingCheckSum1<> RollingCheckSum2
Martin Smith
@Martin Smith - Wouldn't that depend greatly on what you use to match the inner and outer query? I.e., if the comparison is `Where T1.PK < T.PK` with an integer key, you would never calculate the value on the exact same set of rows (just the same set as the previous calc in addition to one more value).
Thomas
@Thomas - What I'm saying is that you can't reuse the result of the last calculation. So by the time you're up to row million you need to trek through the previous 999,999 rows again even though you've just been through 999,998 of them to calculate the value for the previous row. Is it any different to this? http://www.sqlservercentral.com/articles/T-SQL/61539/
Martin Smith
@Martin Smith- Ah. Yeah, no question. One of the reasons why rolling anything is best left to reporting engines and BI tools IMO.
Thomas
@Martin Smith - RE: Hidden RBAR. I suppose. I do not consider correlated subqueries to be RBAR. Jeff's a bit of fanatic so he might. Still, it'd be easy enough to rewrite the query using Ken's approach from your article link. I'd have to look at the execution plans to see if its faster but I can see how it might. Still, as you mentioned, you are still having to redo the calculation many times over.
Thomas
@Thomas - I found another of Jeff's articles here that looks quite interesting (but long) http://www.sqlservercentral.com/articles/T-SQL/68467/. From a very quick skim he seems to be saying that the "Quirky Update" approach could work for this and be preferable to cursors.
Martin Smith
@Martin Smith - Wow. +1 for the Nice find. Quirky update indeed.
Thomas