views:

443

answers:

3

My application includes a client, web tier (load balanced), application tier (load balanced), and database tier. The web tier exposes services to clients, and forwards calls onto the application tier. The application tier then executes queries against the database (using NHibernate) and returns the results.

Data is mostly read, but writes occur fairly frequently, particularly as new data enters the system. Much more often than not, data is aggregated and those aggregations are returned to the client - not the original data.

Typically, users will be interested in the aggregation of recent data - say, from the past week. Thus, to me it makes sense to introduce a cache that includes all data from the past 7 days. I cannot just cache entities as and when they are loaded because I need to aggregate over a range of entities, and that range is dictated by the client, along with other complications, such as filters. I need to know whether - for a given range of time - all data within that range is in the cache or not.

In my ideal fantasy world, my services would not have to change at all:

public AggregationResults DoIt(DateTime starting, DateTime ending, Filter filter)
{
    // execute HQL/criteria call and have it automatically use the cache where possible
}

There would be a separate filtering layer that would hook into NHibernate and intelligently and transparently determine whether the HQL/criteria query could be executed against the cache or not, and would only go to the database if necessary. If all the data was in the cache, it would query the cached data itself, kind of like an in-memory database.

However, on first inspection, NHibernate's second level cache mechanism does not seem appropriate for my needs. What I'd like to be able to do is:

  1. Configure it to always have the last 7 days worth of data in the cache. eg. "For this table, cache all records where this field is between 7 days ago and now."
  2. Have the ability to manually maintain the cache. As new data enters the system, it would be nice if I could just throw it straight into the cache rather than waiting until the cache is invalidated. Similarly, as data falls out of the time period, I'd like to be able to pull it from the cache.
  3. Have NHibernate intelligently understand when it can serve a query directly from the cache rather than hitting the database at all. eg. If the user asks for an aggregate of data over the past 3 days, that aggregation should be calculated directly from the cache rather than touching the DB.

Now, I'm pretty sure #3 is asking too much. Even if I can get the cache populated with all the data required, NHibernate has no idea how to efficiently query that data. It would literally have to loop over all entities in order to discriminate which are relevant to the query (which might be fine, to be honest). Also, it would require an implementation of NHibernate's query engine that executed against objects rather than a database. But I can dream, right?

Assuming #3 is asking too much, I would require some logic in my services like this:

public AggregationResults DoIt(DateTime starting, DateTime ending, Filter filter)
{
    if (CanBeServicedFromCache(starting, ending, filter))
    {
        // execute some LINQ to object code or whatever to determine the aggregation results
    }
    else
    {
        // execute HQL/criteria call to determine the aggregation results
    }
}

This isn't ideal because each service must be cache-aware, and must duplicate the aggregation logic: once for querying the database via NHibernate, and once for querying the cache.

That said, it would be nice if I could at least store the relevant data in NHibernate's second level cache. Doing so would allow other services (that don't do aggregation) to transparently benefit from the cache. It would also ensure that I'm not doubling up on cached entities (once in the second level cache, and once in my own separate cache) if I ever decide the second level cache is required elsewhere in the system.

I suspect if I can get a hold of the implementation of ICache at runtime, all I would need to do is call the Put() method to stick my data into the cache. But this might be treading on dangerous ground...

Can anyone provide any insight as to whether any of my requirements can be met by NHibernate's second level cache mechanism? Or should I just roll my own solution and forgo NHibernate's second level cache altogether?

Thanks

PS. I've already considered a cube to do the aggregation calculations much more quickly, but that still leaves me with the database as the bottleneck. I may well use a cube in addition to the cache, but the lack of a cache is my primary concern right now.

+1  A: 

Define 2 cache regions "aggregation" and "aggregation.today" with a large expiry time. Use these for your aggregation queries for previous days and today respectively.

In DoIt(), make 1 NH query per day in the requested range using cacheable queries. Combine the query results in C#.

Prime the cache with a background process which calls DoIt() periodically with the date range that you need to be cached. The frequency of this process must be lower than the expiry time of the aggregation cache regions.

When today's data changes, clear cache region "aggregation.today". If you want to reload this cache region quickly, either do so immediately or have another more frequent background process which calls DoIt() for today.

When you have query caching enabled, NHibernate will pull the results from cache if possible. This is based on the query and parameters values.

Lachlan Roche
Have re-read your answer several times to make sure I understand. I think what you're suggesting may at least allow me to leverage the second level cache, but I'd still have to write the aggregation logic twice - once for the cache query, and once for the database query. Right?
Kent Boogaart
My approach caches the aggregate queries directly, by using the exact same query for cache loading including cache settings. This lets NH do the actual cache put/get work.
Lachlan Roche
Understood, but that won't work for me because the parameter values and even the parameter themselves (due to filters) aren't known until a user makes a request. Still voted you up though because I think I can at least use the second level cache to store the data, even if I have to manually query it myself.
Kent Boogaart
A: 

When analyzing the NHibernate cache details i remember reading something that you should not relay on the cache being there, witch seems a good suggestion.

Instead of trying to make your O/R Mapper cover your applications needs i think rolling your own data/cache management strategy might be more reasonable.

Also the 7 days caching rule you talk about sounds like something business related, witch is something the O/R mapper should not know about.

In conclusion make your app work without any caching at all, than use a profiler (or more - .net,sql,nhibernate profiler ) to see where the bottlenecks are and start improving the "red" parts by eventually adding caching or any other optimizations.

PS: about caching in general - in my experience one caching point is fine, two caches is in the gray zone and you should have a strong reason for the separation and more than two is asking for trouble.

hope it helps

eti
I already have the application working without caching and have already done the performance analysis. Adding a cache will give us 7 times the speed, and increase our scalability greatly since the DB won't be the bottleneck anymore.
Kent Boogaart
+1  A: 

Stop using your transactional ( OLTP ) datasource for analytical ( OLAP ) queries and the problem goes away.

When a domain significant event occurs (eg a new entity enters the system or is updated), fire an event ( a la domain events ). Wire up a handler for the event that takes the details of the created or updated entity and stores the data in a denormalised reporting store specifically designed to allow reporting of the aggregates you desire ( most likely push the data into a star schema ). Now your reporting is simply the querying of aggregates ( which may even be precalculated ) along predefined axes requiring nothing more than a simple select and a few joins. Querying can be carried out using something like L2SQL or even simple parameterised queries and datareaders.

Performance gains should be significant as you can optimise the read side for fast lookups across many criteria while optimising the write side for fast lookups by id and reduced index load on write.

Additional performance and scalability is also gained as once you have migrated to this approach, you can then physically separate your read and write stores such that you can run n read stores for every write store thereby allowing your solution to scale out to meet increased read demands while write demands increase at a lower rate.

Neal
As stated in my question, moving to OLAP is my secondary concern after caching. A cube will not eliminate the problem of the database being the inhibitor of scalability. It merely has the potential to improve query times, which is not my concern right now (they are actually quite speedy already).
Kent Boogaart
When you get thru reading and understanding my response again, you will see that the 2nd paragraph indicates how to separate the read and write stores such that you can scale out but I'll edit the response to make it clearer. Additionally, separating read and write stores will allow you to create read specific caches without burdening your write side with additional overheads.
Neal
+1 thanks. I hadn't considered the scalability of it, so good point on that. I'm still pondering whether an OLAP approach could meet all our requirements and whether it's worth it in terms of effort.
Kent Boogaart
You don't need to build a full OLAP cube style reporting store to gain benefits. Adding 1 or more tables that store the data you require in pre-aggregated format and designed specifically to serve the needs of those pages will greatly simplify your reporting code. Using a domain events style approach will allow you to separate the processing of new data from the creation / updating of the aggregates and therefore give you the flexibility to move from relational -> hybrid -> olap reporting stores.
Neal