views:

353

answers:

4

My memory is failing me. I have a simple audit log table based on a trigger:

ID            int (identity, PK)
CustomerID    int               
Name          varchar(255)      
Address       varchar(255)      
AuditDateTime datetime          
AuditCode     char(1)           


It has data like this:

ID CustomerID Name      Address             AuditDateTime          AuditCode  1  123        Bob       123 Internet Way    2009-07-17 13:18:06.353I          2  123        Bob       123 Internet Way    2009-07-17 13:19:02.117D          3  123        Jerry     123 Internet Way    2009-07-17 13:36:03.517I          4  123        Bob       123 My Edited Way   2009-07-17 13:36:08.050U          5  100        Arnold    100 SkyNet Way      2009-07-17 13:36:18.607I          6  100        Nicky     100 Star Way        2009-07-17 13:36:25.920U          7  110        Blondie   110 Another Way     2009-07-17 13:36:42.313I          8  113        Sally     113 Yet another Way 2009-07-17 13:36:57.627I         


What would be the efficient select statement be to get all most current records between a start and end time? FYI: I for insert, D for delete, and U for update.

Am I missing anything in the audit table? My next step is to create an audit table that only records changes, yet you can extract the most recent records for the given time frame. For the life of me I cannot find it on any search engine easily. Links would work too. Thanks for the help.

+1  A: 

Another (better?) method to keep audit history is to use a 'startDate' and 'endDate' column rather than an auditDateTime and AuditCode column. This is often the approach in tracking Type 2 changes (new versions of a row) in data warehouses.

This lets you more directly select the current rows (WHERE endDate is NULL), and you will not need to treat updates differently than inserts or deletes. You simply have three cases:

  • Insert: copy the full row along with a start date and NULL end date
  • Delete: set the End Date of the existing current row (endDate is NULL)
  • Update: do a Delete then Insert

Your select would simply be:

select * from AuditTable where endDate is NULL

Anyway, here's my query for your existing schema:

declare @from datetime
declare @to datetime

select b.* from (
  select
    customerId
    max(auditdatetime) 'auditDateTime'
  from
    AuditTable
  where
    auditcode in ('I', 'U')
    and auditdatetime between @from and @to
  group by customerId
  having 
    /* rely on "current" being defined as INSERTS > DELETES */
    sum(case when auditcode = 'I' then 1 else 0 end) > 
    sum(case when auditcode = 'D' then 1 else 0 end)
) a
cross apply(
  select top 1 customerId, name, address, auditdateTime
  from AuditTable
  where auditdatetime = a.auditdatetime and customerId = a.customerId
) b

References

A cribsheet for data warehouses, but has a good section on type 2 changes (what you want to track)

MSDN page on data warehousing

Jeff Meatball Yang
If you are looking at versioned data in general, you are 100% right. Picking the most recent row, where the topmost row is "not-expired", then it is best that all versioned rows have fromdate/todate and a SARG-able query is possible upon such an index.Otherwise a self-join and double-scan at the minimum is generally required (and often a tablescan and a half if it is "get all maximum versions of the entity rows from the table"
polyglot
The first reference to "use a 'startDate' and 'endDate' column rather than an auditDateTime and AuditCode column", where would I read about this?
Dr. Zim
That's a technique borrowed from data warehousing. See the references, and look for "Type 2" within the text.
Jeff Meatball Yang
TYVM. I read the article. I have been a snowflake and didn't know it. LOL. Anyway, I think I finally understand your Insert, Update, and Delete comments above to apply to the Audit table (yea it took a while). Very good idea because I could have three transactions running from it, a delete from where exists, and an insert where not exists, and an update join. (I hope I got it right). Thanks again.
Dr. Zim
+1  A: 

Another approach is using a sub select

select a.ID
       , a.CustomerID 
       , a.Name
       , a.Address
       , a.AuditDateTime
       , a.AuditCode
from   myauditlogtable a,
       (select s.id as maxid,max(s.AuditDateTime) 
                 from myauditlogtable as s 
                 group by maxid) 
        as subq
where subq.maxid=a.id;
nos
+2  A: 

Ok, a couple of things for audit log tables.

For most applications, we want audit tables to be extremely quick on insertion.

If the audit log is truly for diagnostic or for very irregular audit reasons, then the quickest insertion criteria is to make the table physically ordered upon insertion time.

And this means to put the audit time as the first column of the clustered index, e.g.

create unique clustered index idx_mytable on mytable(AuditDateTime, ID)

This will allow for extremely efficient select queries upon AuditDateTime O(log n), and O(1) insertions.

If you wish to look up your audit table on a per CustomerID basis, then you will need to compromise.

You may add a nonclustered index upon (CustomerID, AuditDateTime), which will allow for O(log n) lookup of per-customer audit history, however the cost will be the maintenance of that nonclustered index upon insertion - that maintenance will be O(log n) conversely.

However that insertion time penalty may be preferable to the table scan (that is, O(n) time complexity cost) that you will need to pay if you don't have an index on CustomerID and this is a regular query that is performed. An O(n) lookup which locks the table for the writing process for an irregular query may block up writers, so it is sometimes in writers' interests to be slightly slower if it guarantees that readers aren't going to be blocking their commits, because readers need to table scan because of a lack of a good index to support them....


Addition: if you are looking to restrict to a given timeframe, the most important thing first of all is the index upon AuditDateTime. And make it clustered as you are inserting in AuditDateTime order. This is the biggest thing you can do to make your query efficient from the start.

Next, if you are looking for the most recent update for all CustomerID's within a given timespan, well thereafter a full scan of the data, restricted by insertion date, is required.

You will need to do a subquery upon your audit table, between the range,

select CustomerID, max(AuditDateTime) MaxAuditDateTime 
from AuditTrail 
where AuditDateTime >= @begin and Audit DateTime <= @end

and then incorporate that into your select query proper, eg.

select AuditTrail.* from AuditTrail
inner join 
    (select CustomerID, max(AuditDateTime) MaxAuditDateTime 
     from AuditTrail 
     where AuditDateTime >= @begin and Audit DateTime <= @end
    ) filtration
    on filtration.CustomerID = AuditTrail.CustomerID and 
       filtration.AuditDateTime = AuditTrail.AuditDateTime
polyglot
A: 

start and end time? e.g as in between 1am to 3am
or start and end date time? e.g as in 2009-07-17 13:36 to 2009-07-18 13:36

MakkyNZ
An entire datetime start to an entire datetime end, rather than repeating. I need to select all records up to when I want to process them, like on the hour. They continuously change and I want to process up to a point without stopping their process.
Dr. Zim