views:

98

answers:

3

I have a table which maintains performance data of a system, each record is a call made to some important method and consists of the method name, its duration and a token - each request to the system is given a unique token and so all of the records with the same token are the same request, e.g:

CallName    Duration    Token
----------- ----------- -----------
GetData     121         12345
Process     800         12345
SaveData    87          12345

GetData     97          ABCDE
Process     652         ABCDE
SaveData    101         ABCDE

I am interested in aggregate data grouped by the Token and the CallName, for example:

-- The total duration of each request, in descending order
SELECT Token, SUM(Duration) FROM Requests GROUP BY Token ORDER BY SUM(Duration) DESC

-- The average duration of each call, in descending order
SELECT CallName, AVG(Duration) FROM Requests GROUP BY CallName ORDER BY AVG(Duration) DESC

Now this table is potentially very large and I'm only ever going to be interested in the top few records of each query and so I've implemented paging for both of these queries. The trouble is that because these queries involve aggregate functions SQL server ends up doing a table scan anyway.

Surely other people have had this problem before?

What I really need here is an "index" on SUM(Duration) grouped by Token, i.e. a table where I can do things like:

SELECT Token, SumToken FROM RequestTokens ORDER BY SumToken DESC
  • Is this a really bad idea?
  • If so, is there a better way?
  • What would be the best way to do this? Would triggers on INSERT / UPDATE / DELETE work (where I update the aggregate values based on the old values and the changed data), or would I be better off manually updating my "index" when updating this table?

Triggers are the best solution I've come up with so far, but I can already see this being a deadlock / consistency nightmare! :-S

A: 

First off, would not an index on the Token column be sufficient? That way, given the Token value, the SQL query optimizer would scan just that part of the index that contains the row you are interested in. Make this a clustered index, and you'd get optimal performance.

Next up, how do you know which Token value you are interested in aggregating? There is no datetime (or timestamp) column listed, and the Token values appear to be randomly assigned (as opposed to some form of ascending value), so I'd guess that you know the Token value to be aggregated before you issue the query--in which indexing should do what you want. If the values are unknown but somehow ascending, there are a number of tactics you can use to first determine the most recent X Token values, and once you've got that/those Tokens you're back to the partial table scan.

Philip Kelley
The token value is sufficient if I wanted only 1 token, however I want to query based on aggregate values on all tokens - namely I want the top few tokens when sorted by an aggregate value. This will normally require a table scan as SQL server must calculate the aggregate value for all tokens before it can sort them.
Kragen
+1  A: 

How about potentially a view based on the aggregates, possibly even an indexed view. I have not done very much with indexed views but this article talks about using them with complex aggregates like AVG(). Maybe it will get you in the correct direction.

http://msdn.microsoft.com/en-us/library/aa933148%28SQL.80%29.aspx

Gratzy
Thankyou very much, I hadnt realised that there were substitue aggregate functions for indexed views.
Kragen
I ended up sorting this with a combination of indexed views and tables of aggregated data maintained with triggers and scheduled jobs.
Kragen
A: 

I'll take another stab at this one, now that I understand it better. This is a not uncommon reporting problem that leads to data warehousing solutions, like so: can you just add in a second table containing pre-aggregated data? This is indeed denormalized/redundant data... but it seems clear and well defined, and serves the needs of the business. A couple of wrinkles on this idea:

If the data is entered only once, could you modify the data entry routine to add the aggregated row at the same time. If it trickles in over time, you'd need the recurring process to "sweep things up". I avoid triggers on general principle; they might help here, but they could also tie up your system depending on usage patterns and data interrelationships.

How up-to-date must the data be? Can the aggregated data be out of synch with the detailed data, and if so, by how long? You could have a SQL Agent job in place that runs every day/hour/5 minutes to scan for recent entries and update the aggregated table. (Add an indexed "last entered" column, and these updates could be quick.) The trade-off is the period of time your data would be off. (But that datetime column could indicate "up through when" the data is accurate. Maybe you don't make the aggregated data available for after that point in time?)

If the data does not change once it is entered (no updates, no deletes, no late-arriving rows), this could work -- but if you have to maintain updates over time, and the aggregated data changes had to be available at the same time as the data is entered, it could be a nightmare to maintain.

Philip Kelley