views:

152

answers:

2

Hi,

I'm having to return ~70,000 rows of 4 columns of INTs in a specific order and can only use very shallow caching as the data involved is highly volatile and has to be up to date. One property of the data is that it is often highly repetitive when it is in order.

I've started to look at various methods of reducing the row count in order to reduce network bandwidth and client side processing time/resources, but have not managed to find any kind of technique in T-SQL where I can 'compress' repetative rows down into a single row and a 'count' column. e.g.

prop1    prop2    prop3    prop4
--------------------------------
0        0        1        53
0        0        2        55
1        1        1        8
1        1        1        8
1        1        1        8
1        1        1        8
0        0        2        55
0        0        2        55
0        0        1        53

Into:

prop1    prop2    prop3    prop4    count
-----------------------------------------
0        0        1        53       1
0        0        2        55       1
1        1        1        8        4
0        0        2        55       2
0        0        1        53       1

I'd estimate that if this was possible, in many cases what would be a 70,000 row result set would be down to a few thousand at most.

Am I barking up the wrong tree here (is there implicit compression as part of the SQL Server protocol)?

Is there a way to do this (SQL Server 2005)?

Is there a reason I shouldn't do this?

Thanks.

+2  A: 

You can use the count function! This will require you to use the group by clause, where you tell count how to break up, or group, itself. Gropu by is used for any aggregate function in SQL.

select
    prop1,
    prop2,
    prop3,
    prop4,
    count(*) as count
from
    tbl
group by
    prop1,
    prop2,
    prop3,
    prop4,
    y,
    x
order by y, x

Update: The OP mentioned these are ordered by y and x, not part of the result set. In this case, you can still use y and x as part of the group by.

Keep in mind that order means nothing if it doesn't have ordering columns, so in this case, we have to respect that with y and x in the group by.

Eric
This may work, but a key element of his question involves "repetitive" rows of data, and it's not clear what the pattern of inserts into this table is.
David Andres
This doesn't do as required, it will group rows that are not ajacent to each other, and as I indicated as part of the question - this is highly important. Is there some way in which a windowing function could be used in conjunction with this? (I know very little about them). Thanks.
Kieran Benton
@Kieran: Ordering doesn't matter in SQL Server, unless you have columns you're ordering by. You didn't indicate that's what you were doing in the question. However, you did in the comments, and so I changed it into something that works the way you want it to.
Eric
This will work, if the y, x ordering remains consistent among the repetitious rows.
David Andres
@dandres: What? Without a column basis for ordering, SQL Server does not return rows in a consistent order. Without a consistent ordering set, you can't expect to have anything with regards to "adjacent" be meaningful.
Eric
All I mean to say is that y, x could be anything. They could be an incremental values assigned by order of transaction. For example, the first four rows of his data could have y, x = [{0, 0}, {1, 1}, {2, 2}, {3, 3}]. This will ORDER BY correctly, but won't GROUP BY all that because the last two pairs are part of the same repetition. The OP hasn't been specific about the traits of the y, x ordering.
David Andres
Eric, that can happen is the prop1...prop4 can be the same across several x,y values and you want those to be counted together. At extreme it can be that entire 70k rows have the same prop1..prop4 but different x,y, u return 70k count()=1 as opposed to a single count()=70k.
Remus Rusanu
+1  A: 

This will work, though it is painful to look at:

;WITH Ordering
AS
(
  SELECT Prop1,        
  Prop2,        
  Prop3,        
  Prop4,
  ROW_NUMBER() OVER (ORDER BY Y, X) RN
  FROM Props
)
SELECT 
  CurrentRow.Prop1, 
  CurrentRow.Prop2, 
  CurrentRow.Prop3, 
  CurrentRow.Prop4, 
  CurrentRow.RN - 
    ISNULL((SELECT TOP 1 RN FROM Ordering O3 WHERE RN < CurrentRow.RN AND (CurrentRow.Prop1 <> O3.Prop1 OR CurrentRow.Prop2 <> O3.Prop2 OR CurrentRow.Prop3 <> O3.Prop3 OR CurrentRow.Prop4 <> O3.Prop4) ORDER BY RN DESC), 0) Repetitions
FROM Ordering CurrentRow
LEFT JOIN Ordering O2 ON CurrentRow.RN + 1 = O2.RN
WHERE O2.RN IS NULL OR (CurrentRow.Prop1 <> O2.Prop1 OR CurrentRow.Prop2 <> O2.Prop2 OR CurrentRow.Prop3 <> O2.Prop3 OR CurrentRow.Prop4 <> O2.Prop4) 
ORDER BY CurrentRow.RN

The gist is the following:

  1. Enumerate each row using ROW_NUMBER OVER to get the correct order.
  2. Find the maximums per cycle by joining only when the next row has different fields or when the next row does not exist.
  3. Figure out the count of repetitions is by taking the current row number (presumed to be the max for this cycle) and subtracting from it the maximum row number of the previous cycle, if it exists.
David Andres
This returns multiple rows. You'd need a very expensive `distinct` to filter them out.
Eric
This looks to be towards what I was thinking (afraid my knowledge of windowing functions is a bit fuzzy) but I cant get this to execute, missing * in the COUNT() for a start but SQL 2005 at least doesnt seem to like both a PARTITION BY and ORDER BY clause in the OVER. Am I missing something? Cheers.
Kieran Benton
I am working on refining this query, because it does return duplicates and unfortunately the COUNT OVER statement doesn't support ordering. I'll write back with a better solution.
David Andres
the above works with limited testing
David Andres