views:

107

answers:

4

Hi folks,

I have the following sql statement, which works perfectly fine. I was hoping to see how this could be refactored so it doesn't require the use of RANK/PARTITION ... if possible.

SELECT LogEntryId, FileId, CreatedOn, EventTypeId
FROM (SELECT a.LogEntryId, a.FileId, a.CreatedOn,  a.EventTypeId, 
        RANK() OVER (PARTITION BY ClientName ORDER BY a.CreatedOn DESC) AS MostRecentEventRank
    FROM LogEntries a
    WHERE (a.EventTypeId = 2 or a.EventTypeId = 4)) SubQuery
WHERE MostRecentEventRank = 1

What is it trying to do?

  1. Grab all the records in the table, grouped by client name and then ordered by most recently created.
  2. Filter this by only event types #2 (a connection) or #4 (a disconnection).
  3. Now, for each client name, retrieve the most recent record.

this in effect is grabbing the most recent event (for a connection or disconnection), for each unique user in the table.

I do like RANK/PARTITION, but i was hoping to see if it's possible to do without using it.

+1  A: 

You could use an exclusive left join:

select     cur.*
from       LogEntries cur
left join  LogEntries next
on         next.ClientName = cur.ClientName
           and next.EventTypeId in (2,4)
           and next.CreatedOn > cur.CreatedOn               
where      next.ClientName is null
           and cur.EventTypeId in (2,4)

This joins the table on itself, searching for later rows in the on condition. In the where clause, you specify that no later row may exist. In that way you filter out all but the latest row per client.

Andomar
Does this really work? Cute trick if it does, but how is it in terms of speed? It seem to me like this is the same as doing a sub-query per row, is that how SQL implements it?
Hogan
It really works and it's really fast, at a cost in readability. If you're into performance, see the `cross apply` solution at http://explainextended.com/2009/12/01/sql-server-selecting-records-holding-group-wise-maximum-with-ties/
Andomar
@Andomar: Shouldn't you have `AND next.eventtypeid IN (2,4)` to make sure that the comparison is on records with teh correct `eventtypeid` value?
OMG Ponies
@Hogan: http://explainextended.com/2009/09/15/not-in-vs-not-exists-vs-left-join-is-null-sql-server/
OMG Ponies
@OMG Ponies: You're right, I'll edit the answer
Andomar
@OMG Ponies: that article is interesting, though I wish he wouldn't say that NOT EXISTS is not a join condition. NOT EXISTS with an outer reference is *always* a join.
Emtucifor
@Emtucifor: Take it up with the author
OMG Ponies
@Emtucifor: `NOT EXISTS` is an anti-join, so technically... not a join. ;)
Aaronaught
@OMG Ponies: It was a an interested-colleague to interested-colleague comment. I'm sorry you took it the way you did.
Emtucifor
@Aaronaught: An anti-join is a join. Look in the execution plan and you'll always see a join (in fact, sometimes an INNER join or a LEFT join for semi-joins). Something that's always a join in the execution plan seems to be a join...
Emtucifor
Hi peeps. I tried the _original_ post (before it was edited) right after it was posted. took 1 hour 30 mins. i've not looked at index's and the query plan but i'm sure it's my schema instead of the query + lack of indexing :) and that's on a table with 1288095 rows.
Pure.Krome
That's very slow for just a million rows. You could create an index on `(EventTypeId, ClientName, CreatedOn)`
Andomar
A: 

Here you go. Might be faster... not sure. Also, this assumes that ClientName+CreatedOn is unique.

;WITH MostRecent AS
(
   SELECT ClientName, Max(CreatedOn) AS CreatedOn
   FROM LogEntries
   WHERE EventTypeID IN (2,4)
   GROUP BY ClientName
)
SELECT LogEntryId, FileId, CreatedOn, EventTypeId
FROM LogEntries L
INNER JOIN MostRecent R ON L.ClientName = R.ClientName AND L.CreatedOn = R.CreatedON

Note, I did not test might have typos.

Hogan
Just note this solution won't deal with duplicate CreatedOn dates for the same person.
Emtucifor
@Emtucifor, don't I say that in the first line?
Hogan
:) cheers. but client name AND createdOn are not unique. .. which was why i originally usinged the PARTITION keyword.
Pure.Krome
@Hogan: My mistake! You did. I wish I could delete comments.
Emtucifor
Now I get it. You edited your answer 8 hours ago to add that first line, and I posted my comment 7 hours ago, but I'd had the page open for more than an hour, so it wasn't in my version when I commented.
Emtucifor
@Emtucifor: It is true, the first version I did not not have the note about uniques. Cheers.
Hogan
+2  A: 

Single table scan, no windowing function, single group by, no problems with duplicate dates, equal performance with windowing functions, or even outperforms them with really large queries. (Update: I don't know how it performs compared to the TOP 1 WITH TIES / CROSS APPLY method. Since it uses a scan, it might be slower in some situations.)

SELECT
   LogEntryID = Convert(int, Substring(Packed, 9, 4)),
   FileID = Convert(int, Substring(Packed, 13, 4)),
   CreatedOn = Convert(datetime, Substring(Packed, 1, 8)),
   EventTypeID = Convert(int, Substring(Packed, 17, 4))
FROM
   (
      SELECT
         Packed = Max(
            Convert(binary(8), CreatedOn)
            + Convert(binary(4), LogEntryID)
            + Convert(binary(4), FileID)
            + Convert(binary(4), EventTypeID)
         )
      FROM LogEntries
      WHERE EventTypeID IN (2,4)
      GROUP BY ClientName
   ) X

If anyone would like to see this in action, here's some creation script:

USE tempdb
CREATE TABLE LogEntries (
   LogEntryID int not null identity(1,1),
   FileID int,
   CreatedOn datetime,
   EventTypeID int,
   ClientName varchar(30)
)

INSERT LogEntries VALUES (1, GetDate()-20, 2, 'bob')
INSERT LogEntries VALUES (1, GetDate()-19, 3, 'bob')
INSERT LogEntries VALUES (1, GetDate()-18, 4, 'bob')
INSERT LogEntries VALUES (1, GetDate()-17, 3, 'bob')
INSERT LogEntries VALUES (1, GetDate()-19.5, 2, 'anna')
INSERT LogEntries VALUES (1, GetDate()-18.5, 3, 'anna')
INSERT LogEntries VALUES (1, GetDate()-17.5, 4, 'anna')
INSERT LogEntries VALUES (1, GetDate()-16.5, 3, 'anna')

Please note that this method is taking advantage of the internal byte representation of the given data types having the same ordering as the type's values. Packed data types like float or decimal will NOT work: those would require conversion to something suitable first, such as int, bigint, or character.

Also, the new Date and Time data types in SQL 2008 have different representations that won't pack correctly to use with this method. I haven't examined the Time data type yet, but for the Date data type:

DECLARE @d date
SET @d ='99990101'
SELECT Convert(binary(3), @d) -- 0x6EB837

The actual value is 0x37B86E, so it's storing them in reverse byte order (the "zero" date is 0001-01-01).

Emtucifor
+1 Interesting technique. I wonder if the binary conversions perform better than the anti-join for a large number of rows
Andomar
In my experience, binary conversions are very cpu-cheap and always a great tradeoff over more I/O. Converting values to string is slow and expensive, but binary conversions are basically a memory copy.
Emtucifor
I'd particularly like to know how well this stacks up against Quassnoi's optimization with a recursive CTE and `CROSS APPLY`; this solution is good but will still result in a full table/index scan, whereas Quassnoi's ends up being a bit like Oracle's skip scan.
Aaronaught
@Emtucifor : i get this error *Msg 8152, Level 16, State 17, Line 1 String or binary data would be truncated* . Do you need a table schema?
Pure.Krome
@Pure.Krome: That would be helpful. The query above works with the create table statement below it. Indexes would also be wonderful.
Emtucifor
+3  A: 

Yet another variation: select the clients, then use CROSS APPLY (.. TOP (1) ... ORDER BY ...) to get the relevant entry.

SELECT c.ClientName,r.LogEntryId, r.FileId, r.CreatedOn,  r.EventTypeId
FROM (
 SELECT DISTINCT ClientName
 FROM LogEntries
 WHERE EventTypeId IN (2,4)) as c
CROSS APPLY (
   SELECT TOP (1) a.LogEntryId, a.FileId, a.CreatedOn,  a.EventTypeId
   FROM LogEntries as a
   WHERE a.ClientName = c.ClientName
   AND a.EventTypeId IN (2,4)
   ORDER BY a.CreatedOn DESC) as r;

Update

To talk about performance on a T-SQL query without knowing the schema is non-sense. This query is perfectly optimal on a properly designed schema for its needs. Since the access is by ClientName and CreatedOn, then even a simplistic schema would need to take this into consideration:

CREATE TABLE LogEntries (
   LogEntryId int identity(1,1),
   FileID int,
   CreatedOn datetime,
   EventTypeID int,
   ClientName varchar(30)
);

create clustered index cdxLogEntries on LogEntries (
    ClientName, CreatedOn DESC);
go

And lets load the table with some 2.4M rows:

declare @i int;
set @i = 0;

while @i < 1000
begin
    insert into LogEntries (FileId, CreatedOn, EventTypeId, ClientName)
    select cast(rand()*100 as int),
        dateadd(minute, -rand()*10000, getdate()),
        cast(rand() * 5 as int),
        'Client' + cast(@i as varchar(10))
        from master..spt_values;
    set @i = @i+1;
end

What time and IO do we get with set statistics io on; set statistics time on; on a warmed cache?

(410 row(s) affected)
Table 'LogEntries'. Scan count 411, logical reads 14354, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
   CPU time = 1219 ms,  elapsed time = 1932 ms.

1.9 sec to get the data from 2.4M entries on my laptop (which is 4 years old and has 1Gb RAM). And there is still plenty of room for improvement int he schema design. Separate ClientName into a normalized table with a trusted a foreign key from LogEntries into it would reduce the time significantly. Proper filtered indexes on EntryTypeId IN (2,4) also would contribute. We did not even start the exploration of parallelism posibilities.

This is SQL, performance is obtained on the drawing board of your schema, not in the text editor of your query.

Remus Rusanu
This is coming out as 2 table scans. Were you looking to implement the solution from Quassnoi as linked by OMG Ponies in another answer's comments?
Emtucifor
'coming out as 2 table scans'... on what schema exactly? I updated my question since talking about performance of a piece of SQL text without any schema is rather silly.
Remus Rusanu
I wasn't insulting you. I simply stated the fact of how it came out for me. Since the table I was working with had no indexes, perhaps adding some will change things. My comment was supposed to start dialog, not hit you across the face with my gauntlet.
Emtucifor
@Emtucifor: I didn't feel insulted :) My normal dialogue tone is kind of emphatic and blunt and sometimes comes across rather aggressive, but I took no offense and intended none.
Remus Rusanu