views:

473

answers:

2

I have a generic log table which I can attach to processes and their results. I get the average time using a process performance view:

WITH    Events
          AS (
              SELECT    PR.DATA_DT_ID
                       ,P.ProcessID
                       ,P.ProcessName
                       ,PL.GUID
                       ,PL.EventText
                       ,PL.EventTime
              FROM      MISProcess.ProcessResults AS PR
              INNER JOIN MISProcess.ProcessResultTypes AS PRT
                        ON PRT.ResultTypeID = PR.ResultTypeID
                           AND PRT.IsCompleteForTiming = 1
              INNER JOIN MISProcess.Process AS P
                        ON P.ProcessID = PR.ProcessID
              INNER JOIN MISProcess.ProcessLog AS PL
                        ON PL.BatchRunID = PR.BatchRunID
                           AND PL.ProcessID = P.ProcessID
                           AND [GUID] IS NOT NULL
                           AND (
                                PL.EventText LIKE 'Process Starting:%'
                                OR PL.EventText LIKE 'Process Complete:%'
                               )
             )
SELECT  Start.DATA_DT_ID
       ,Start.ProcessName
       ,AVG(DATEDIFF(SECOND, Start.EventTime, Finish.EventTime)) AS AvgDurationSeconds
       ,COUNT(*) AS NumRuns
FROM    Events AS Start
INNER JOIN Events AS Finish
        ON Start.EventText LIKE 'Process Starting:%'
           AND Finish.EventText LIKE 'Process Complete:%'
           AND Start.DATA_DT_ID = Finish.DATA_DT_ID
           AND Start.ProcessID = Finish.ProcessID
           AND Start.GUID = Finish.GUID
GROUP BY Start.DATA_DT_ID
       ,Start.ProcessName

The GUID links a start and end entry amongst other "note"-style entries.

Now I can filter against this to eliminate old months' runs, so the average performance of a process can be taken only over the last 3 months, say.

The problem comes when I have outliers due to poor performance or debugging, where the process completes in 0 seconds or whatever.

I'd like to somehow eliminate any outliers in an automatic way.

Would the VAR() or STDEV() aggregate functions work?

+2  A: 

Aggregate functions ignore NULL (except for COUNT(*)) so if you can convert outliers to NULL in your expression, that'd help.

AVG( CASE WHEN Start.EventTime = Finish.EventTime THEN NULL
     ELSE DATEDIFF(SECOND, Start.EventTime, Finish.EventTime) 
     END CASE )
Bill Karwin
Note for any casual observers: Count(field_name) will ignore NULL.
Eric
A: 

Without having parsed your query in detail, my first idea is:

  • do your query into an table variable (or temptable)
  • remove outliers from the table using whatever metric you use use to define outliers
  • this metric might just be removing all values below or above a fixed threshold
  • and/or first calculating mean and stdev and then removing all entries more than x stdev away from mean
  • then do further analysis on the cleaned temptable
Ben Schwehn