views:

93

answers:

2

I often need relatively small (<10000 entries <1kb) caches for speeding up calculations. My usual code looks like this:

cache = {}
def calculate_caches(parms):
    if parms not in cache:
        cache[parms] = calculate(parms)
    return cache[parms]

Works fine but for longer running processes I'm afraid of memory leaks. So I often implement brute force memory clamping like this:

if len(cache) > 1000:
    cache = {}

Works reasonably well in most cases and still is clean, simple code. But If I want a real LRU strategy I need timestamps together with the cache entry. The problem of using a dict for this is, that expiring the cache now means traversing the whole cache which is neither elegant nor efficient.

cache = {}
def calculate_caches(parms):
    if parms not in cache:
        cache[parms] = (time.time(), calculate(parms))
    expire()
    return cache[parms][1]

def expire()
    if len(cache) > 1000:
        mintime = time.time()
        time2key = {}
        for key, (timestamp, val) in cache.items():
            mintime = min([mintime, timestamp])
            time2key[timestamp] = key
        if mintime in time2key:
            del cache[time2key[mintime]]

Are there preferable approaches / datastructures to implement ad-hoc caching?

My problem is quite similar to this question but I don't need the list to be sorted by time and I don't want dupes.

+1  A: 

A very simple way to do this without using timestamps would be to have an ordered dictionary, where you have the MRU at the end (this is, when a request for the same object comes a second time, you delete it and add it up again at the end of the dict) so, when you need to expire, you just remove a slice of size X from the beginning of the ordered dict if the size is greater than the limit.

Efficiency would now depend on how that ordered dict is implemented.

Vinko Vrsalovic
A: 

I doubt there's a golden bullet for this; the optimal strategy depends strongly on the cost of cache misses and the temporal distribution of the parameters for your calculation.

Garbage collection methods might give you some inspiration. If you think of your cache as the heap, and cache hits as references, then you have the problem of efficiently collecting cached results with low (not zero) numbers of hits. The problem is far more forgiving than GC, because anything you nuke you can be recalculated.

A refinement to your method in this vein would be to introduce an additional cache for frequently-hit parameters. Add a counter to each cached value that is incremented on cache hits. When some threshold is passed, the cached value is promoted to the additional cache. Both generations of cache can be size clamped, so you'd still have a hard limit on memory use. It's an empirical question if the (possible) reduction in cache misses justifies the overhead (lookup in two caches, hit counters, copying, etc.)...

othercriteria