views:

82

answers:

2

Problem setting

  1. Entities arrive for processing and are taken through a series of steps which operate on those entities and possibly on other, related, entities and generate some results;
  2. Some of the entities are required to be processed in real-time, without any database access;
  3. Currently the implementation simply looks up entities in the database, without any caching.

Optimisation time :-)

Possible approaches

Simple cache

A simple in-memory cache has 2 flaws:

  1. it may overflow, since we are talking about a large number of entities;
  2. it does not guarantee that the required entities are found in the cache, and it has no way of being queried about the availability or being asked to "preload" itself.

So this is a no-go.

Entity analysis + preloading

I'm considering building some sort of analyser to find out which data needs to be retrieved for a given entity, even in large forms, and do a request for the caches to load the required data out-of-band.

The steps would be:

  1. Entity arrives. If it's required to be processed in-memory, send a cache load request;
  2. Entity is placed in a cache waiting queue until the cache loaded response is received. This may be immediate if the data is available;
  3. Entity is sent for processing and makes use of the loaded data;
  4. Caches are cleared. This does have the potential for clearing policies but I'm not concerned about those at the moment.

Questions

What are your opinions about this approach? Am I missing some well-known data access patterns which can be applied in this case?


Update 1 : Forgot to mention that the whole processing is single-threaded, and that does restrict the options considerably.

+2  A: 

Basically you're trying to cache database queries. By the time you get around to using the cache, the database state might have changed. That's a recipe for data inconsistencies.

As an alternative, check if you can optimize the database. It's very possible to have a database answer queries in < 10 milliseconds. You can even have an indexed view or the like, and access it regularly so it's cached in memeory.

As another alternative, consider this: the total amount of work does not decrease by pre-fetching the data. The entity has to wait for the pre-fetch, whether it's queued or not. Since the work has to be done anyway, you might as well do it in the queue worker process? Consider increasing the number of worker processes, so you can process more queues simultaneously.

EDIT: As your comment says you're bound to a single worker thread:

  • Maybe split the processing in two steps? First process retrieves the database data, and stores the enriched entity in a new queue. Second process reads from the new queue, and does the work involving other in-memory data sources
  • Protect the other in-memory entities with a global mutex. This means many worker threads can talk to the database, while only one can access the other in-memory entities.
Andomar
Thanks for the answer. I forgot to mention that all the processing is done on a single thread, since we are accessing other in-memory data sources.
Robert Munteanu
After the edit: this is mostly what I had in mind, loading the data on a different thread while keeping the main processing separate.Thanks.
Robert Munteanu
+3  A: 

You said:

A simple in-memory cache has 2 flaws:

  1. it may overflow, since we are talking about a large number of entities
  2. it does not guarantee that the required entities are found in the cache, and it has no way of being queried about the availability or being asked to "preload" itself.

Perhaps I am completely misunderstanding your question and needs, but this sounds incorrect on a number of levels:

  1. Many caching solutions allow you to define a maximum number of elements that you can store in the cache. Once the maximum size is hit, items can be removed on a first-in-first-out policy or based on least-recently-used.
  2. A cache is not supposed to "guarantee that the required entities are found in the cache"; this is not the purpose of a cache.
  3. The API to most caching solutions does allow you to check if a key is present in the cache (in fact, if you built your own solution using a Map you could still do this...).
  4. Ehcache has self-populating caches, which can be used to allow you to pre-populate a cache before you need to start retrieving items (another link here).
matt b
Thanks for the answer. 1 and 2 are the root cause of my problem. I'm calling it a 'cache' for lack of a better name. It's rather a in-memory data layer which is offloaded to a disk-based storage for memory purposes. The trouble is that I need a guarantee, otherwise a stock cache would've done perfectly.And 4) looks nice, thanks.
Robert Munteanu
Ah, I see then. I think use of the word "cache" to mean multiple things is definitely a source of confusion
matt b