tags:

views:

324

answers:

4

The root problem: I have an application which has been running for several months now. Users have been reporting that it's been slowing down over time (so in May it was quicker than it is now). I need to get some evidence to support or refute this claim. I'm not interested in precise numbers (so I don't need to know that a login took 10 seconds), I'm interested in trends - that something which used to take x seconds now takes of the order of y seconds.

The data I have is an audit table which stores a single row each time the user carries out any activity - it includes a primary key, the user id, a date time stamp and an activity code:

create table AuditData (
    AuditRecordID int identity(1,1) not null, 
    DateTimeStamp datetime not null,
    DateOnly datetime null,
    UserID nvarchar(10) not null,
    ActivityCode int not null)

(Notes: DateOnly (datetime) is the DateTimeStamp with the time stripped off to make group by for daily analysis easier - it's effectively duplicate data to make querying faster).

Also for the sake of ease you can assume that the ID is assigned in date time order, that is 1 will always be before 2 which will always be before 3 - if this isn't true I can make it so).

ActivityCode is an integer identifying the activity which took place, for instance 1 might be user logged in, 2 might be user data returned, 3 might be search results returned and so on.

Sample data for those who like that sort of thing...:

1, 01/01/2009 12:39, 01/01/2009, P123, 1
2, 01/01/2009 12:40, 01/01/2009, P123, 2
3, 01/01/2009 12:47, 01/01/2009, P123, 3
4, 01/01/2009 13:01, 01/01/2009, P123, 3

User data is returned (Activity Code 2) immediate after login (Activity Code 1) so this can be used as a rough benchmark of how long the login takes (as I said, I'm interested in trends so as long as I'm measuring the same thing for May as July it doesn't matter so much if this isn't the whole login process - it takes in enough of it to give a rough idea).

(Note: User data can also be returned under other circumstances so it's not a one to one mapping).

So what I'm looking to do is select the average time between login (say ActivityID 1) and the first instance after that for that user on that day of user data being returned (say ActivityID 2).

I can do this by going through the table with a cursor, getting each login instance and then for that doing a select to say get the minimum user data return following it for that user on that day but that's obviously not optimal and is slow as hell.

My question is (finally) - is there a "proper" SQL way of doing this using self joins or similar without using cursors or some similar procedural approach? I can create views and whatever to my hearts content, it doesn't have to be a single select.

I can hack something together but I'd like to make the analysis I'm doing a standard product function so would like it to be right.

+1  A: 
SELECT TheDay, AVG(TimeTaken) AvgTimeTaken
FROM (  
SELECT 
    CONVERT(DATE, logins.DateTimeStamp) TheDay
    , DATEDIFF(SS, logins.DateTimeStamp, 
       (SELECT TOP 1 DateTimeStamp 
        FROM AuditData userinfo 
        WHERE UserID=logins.UserID 
        and userinfo.ActivityCode=2 
        and userinfo.DateTimeStamp > logins.DateTimeStamp )
       )TimeTaken
FROM AuditData logins
WHERE 
    logins.ActivityCode = 1
) LogInTimes
GROUP BY TheDay

This might be dead slow in real world though.

idstam
Thanks, I'll give it a go. It would have to go it some to be slower than the current cursor style cludge I'm using though.
Jon Hopkins
+1  A: 

In Oracle this would be a cinch, because of analytic functions. In this case, LAG() makes it easy to find the matching pairs of activity codes 1 and 2 and also to calculate the trend. As you can see, things got worse on 2nd JAN and improved quite a bit on the 3rd (I'm working in seconds rather than minutes).

SQL> select DateOnly
  2         , elapsed_time
  3         , elapsed_time - lag (elapsed_time) over (order by DateOnly) as trend
  4  from
  5      (
  6      select DateOnly
  7             , avg(databack_time - prior_login_time) as elapsed_time
  8      from
  9          ( select DateOnly
 10                  , databack_time
 11                  , ActivityCode
 12                  , lag(login_time) over (order by DateOnly,UserID, AuditRecordID, ActivityCode) as prior_login_time
 13            from
 14              (
 15                  select a1.AuditRecordID
 16                         , a1.DateOnly
 17                         , a1.UserID
 18                         , a1.ActivityCode
 19                         , to_number(to_char(a1.DateTimeStamp, 'SSSSS')) as login_time
 20                         , 0 as databack_time
 21                  from   AuditData a1
 22                  where a1.ActivityCode = 1
 23                  union all
 24                  select a2.AuditRecordID
 25                         , a2.DateOnly
 26                         , a2.UserID
 27                         , a2.ActivityCode
 28                         , 0 as login_time
 29                         , to_number(to_char(a2.DateTimeStamp, 'SSSSS')) as databack_time
 30                  from   AuditData a2
 31                  where a2.ActivityCode = 2
 32                  )
 33              )
 34      where ActivityCode = 2
 35      group by  DateOnly
 36  )
 37  /

DATEONLY  ELAPSED_TIME      TREND
--------- ------------ ----------
01-JAN-09          120
02-JAN-09          600        480
03-JAN-09          150       -450

SQL>

Like I said in my comment I guess you're working in MSSQL. I don't know whether that product has any equivalent of LAG().

APC
I am using SQL but you get an upvote for something interesting I didn't know about Oracle.
Jon Hopkins
I am still having to get used to the idea that "SQL" means MS SQL Server rather than "Structured Query Language".
APC
+1  A: 

If the assumptions are that:

  1. Users will perform various tasks in no mandated order, and
  2. That the difference between any two activities reflects the time it takes for the first of those two activities to execute,

Then why not create a table with two timestamps, the first column containing the activity start time, the second column containing the next activity start time. Thus the difference between these two will always be total time of the first activity. So for the logout activity, you would just have NULL for the second column.

So it would be kind of weird and interesting, for each activity (other than logging in and logging out), the time stamp would be recorded in two different rows--once for the last activity (as the time "completed") and again in a new row (as time started). You would end up with a jacob's ladder of sorts, but finding the data you are after would be much more simple.

In fact, to get really wacky, you could have each row have the time that the user started activity A and the activity code, and the time started activity B and the time stamp (which, as mentioned above, gets put down again for the following row). This way each row will tell you the exact difference in time for any two activities.

Otherwise, you're stuck with a query that says something like

SELECT TIME_IN_SEC(row2-timestamp) - TIME_IN_SEC(row1-timestamp)

which would be pretty slow, as you have already suggested. By swallowing the redundancy, you end up just querying the difference between the two columns. You probably would have less need of knowing the user info as well, since you'd know that any row shows both activity codes, thus you can just query the average for all users on any given day and compare it to the next day (unless you are trying to find out which users are having the problem as well).

Anthony
The issue I have is that this is historical data - I can't record or use more than I have. There are also 48 possible activities and I'd potentially like to be able to measure relationships between each of them (for instance there might be requests to 3rd party systems and receipt of data from third party systems).
Jon Hopkins
+1  A: 

This is the faster query to find out, in one row you will have current and row before datetime value, after that you can use DATEDIFF ( datepart , startdate , enddate ). I use @DammyVariable and DamyField as i remember the is some problem if is not first @variable=Field in update statement.

SELECT *, Cast(NULL AS DateTime) LastRowDateTime, Cast(NULL As INT) DamyField INTO #T FROM AuditData 
GO
CREATE CLUSTERED INDEX IX_T ON #T (AuditRecordID)
GO
DECLARE @LastRowDateTime DateTime
DECLARE @DammyVariable INT

SET @LastRowDateTime = NULL 
SET @DammyVariable = 1

UPDATE #T SET 
  @DammyVariable = DammyField = @DammyVariable
, LastRowDateTime = @LastRowDateTime 
    , @LastRowDateTime = DateTimeStamp 
option (maxdop 1)