views:

223

answers:

2

I have an interesting delimma. I have a very expensive query that involves doing several full table scans and expensive joins, as well as calling out to a scalar UDF that calculates some geospatial data.

The end result is a resultset that contains data that is presented to the user. However, I can't return everything I want to show the user in one call, because I subdivide the original resultset into pages and just return a specified page, and I also need to take the original entire dataset, and apply group by's and joins etc to calculate related aggregate data.

Long story short, in order to bind all of the data I need to the UI, this expensive query needs to be called about 5-6 times.

So, I started thinking about how I could calculate this expensive query once, and then each subsequent call could somehow pull against a cached result set.

I hit upon the idea of abstracting the query into a stored procedure that would take in a CacheID (Guid) as a nullable parameter.

This sproc would insert the resultset into a cache table using the cacheID to uniquely identify this specific resultset.

This allows sprocs that need to work on this resultset to pass in a cacheID from a previous query and it is a simple SELECT statement to retrieve the data (with a single WHERE clause on the cacheID).

Then, using a periodic SQL job, flush out the cache table.

This works great, and really speeds things up on zero load testing. However, I am concerned that this technique may cause an issue under load with massive amounts of reads and writes against the cache table.

So, long story short, am I crazy? Or is this a good idea.

Obviously I need to be worried about lock contention, and index fragmentation, but anything else to be concerned about?

+1  A: 

I have done that before, especially when I did not have the luxury to edit the application. I think its a valid approach sometimes, but in general having a cache/distributed cache in the application is preferred, cause it better reduces the load on the DB and scales better.

The tricky thing with the naive "just do it in the application" solution, is that many time you have multiple applications interacting with the DB which can put you in a bind if you have no application messaging bus (or something like memcached), cause it can be expensive to have one cache per application.

Obviously, for your problem the ideal solution is to be able to do the paging in a cheaper manner, and not need to churn through ALL the data just to get page N. But sometimes its not possible. Keep in mind that streaming data out of the db can be cheaper than streaming data out of the db back into the same db. You could introduce a new service that is responsible for executing these long queries and then have your main application talk to the db via the service.

Sam Saffron
Then I'd have to pipe thousands of results back to the app?
FlySwat
To elaborate, I perform a lot of SQL operations on this data, and just send the results to the app. So caching in the app would be counterproductive.
FlySwat
@FlySwat, I think that introducing an intermediary for this may be worth considering, you don't want to send your DB into a fit every time people execute a long running report, having a service in the middle will give you the ability to throttle and reduce load on the DB
Sam Saffron
+1  A: 

Your tempdb could balloon like crazy under load, so I would watch that. It might be easier to put the expensive joins in a view and index the view than trying to cache the table for every user.

Eric
Results are very arbitrary and based on a lot of input parameters, so the view would be enormous. This was my first thought as well, but I realized that in order to do it the view would need millions of rows.
FlySwat
@FlySway, million of rows is not necessarily that bad. especially if at the end of the day you are better distributing the load on the db and overall reducing it.
Sam Saffron