views:

498

answers:

4

I want a collection in Java which:

  • maps arbitrary Objects to Objects (not String or otherwise restricted keys only)
  • will be used as a cache; if the key is not in the cache, a value will be computed (this doesn't have to be built into the collection)
  • will be accessed from multiple threads simultaneously
  • will never have items removed from it
  • must be very efficient to read (cache hit); not necessarily efficient to write (cache miss)

It's OK if cache misses simultaneously in multiple threads cause redundant computations; the typical case is that the cache is mostly filled by one thread at first.

A synchronized block around a thread-unsafe hashtable fails the efficient-to-read criterion. Thread-local caches would be straightforward but mean that new threads are expensive since they have full copies of the cache.

Java 1.5 built-ins, or one-or-few class files we can copy into our MIT-licensed project, are preferred, rather than large external libraries.

+2  A: 

This is my own idea for a solution, but I'm not an expert on threaded programming, so please comment/vote/compare to other answers as you feel appropriate.

Use a thread-local variable (java.lang.ThreadLocal) which contains a per-thread hashtable used as a first-level cache. If the key is not found in this table, synchronized access is done to a second-level cache, which is a synchronized-access hashtable shared by all threads. In this way, the calculation of the cache value is only ever done once, and it is shared among all threads, but each thread has a local copy of the mapping from keys to values, so there is some memory cost (but less than having independent per-thread caches), but reads are efficient.

Kevin Reid
It's an ok solution, but an implementation using the java concurrent hashmap (see my answer) is just as simple to implement, and will perform much better (and consume significantly less memory)
Martin
+1  A: 

How about the cache described in section 5.6 of Concurrency In Practice, by Brian Goetz? It's outlined here.

It only uses classes from the java.util.concurrent package.

The linked article builds up a cache and describes the weaknesses of each version, until final version is an efficient cache in which only one concurrent thread will compute an absent entry.

I've cut and pasted the final code below, but its worth reading through the article and thinking about the issues outlined. Or even better - buy the book - it's great.

import java.util.concurrent.*;

public class Memoizer<A, V> implements Computable<A, V> {
  private final ConcurrentMap<A, Future<V>> cache
      = new ConcurrentHashMap<A, Future<V>>();
  private final Computable<A, V> c;
  public Memoizer(Computable<A, V> c) { this.c = c; }
  public V compute(final A arg) throws InterruptedException {
      while (true) {
          Future<V> f = cache.get(arg);
          if (f == null) {
              Callable<V> eval = new Callable<V>() {
                  public V call() throws InterruptedException {
                      return c.compute(arg);
                  } 
              };
              FutureTask<V> ft = new FutureTask<V>(eval);
              f = cache.putIfAbsent(arg, ft);
              if (f == null) {
                  f = ft;
                  ft.run();
              }
          }
          try {
              return f.get();
          } catch (CancellationException e) {
              cache.remove(arg, f);
          } catch (ExecutionException e) {
          // Kabutz: this is my addition to the code...
          try {
             throw e.getCause();
          } catch (RuntimeException ex) {
              throw ex;
          } catch (Error ex) {
              throw ex;
          } catch (Throwable t) {
              throw new IllegalStateException("Not unchecked", t);
          }
        }
     }
  }
}
A_M
We don't need to ensure that values are only computed once, so it sounds like we could just use a ConcurrentHashMap, and not worry about redundant computation from simultaneous cache misses. Any reason not to do that?
Kevin Reid
My solution above is based on the assumption that you don't need to worry about calculating a value twice :)
Martin
Yup - understood. The article linked to goes through that case too. Also, the question didn't state that the cache must explicitly allow concurrent computations. So I assumed a solution that stopped them was also acceptable. I guess I shouldn't have posted the code. Perhaps thought should be given as to who will be using the cache. Concurrent code is notoriously difficult to understand and a cache which could compute entries twice might end up being used by an inexperienced programmer in a situation where you definitely don't want to compute the entry twice.
A_M
+7  A: 

Use the java concurrent hashmap

ConcurrentHashMap<object, object> table;

public object getFromCache(object key)
{
    value = table.get(key);

    if (value == null)
    {
        //key isn't a key into this table, ie. it's not in the cache
        value = CalculateValueForKey(key)
        object fromCache = table.putIfAbsent(key, value);
    }

    return value;
}

/**
* This calculates a new value to put into the cache
*/
public abstract object CalculateValueForKey(object key);

N.b. This is not longer a general solution for multithreaded caching, since it relies on the stated fact that objects are immutable, and thus object equivalence is not important.

Martin
After editing 3 times, it's finally correct ;)Depending on how your CalculateValueForKey method is implemented this should be pretty fast even on writes (and will never block)
Martin
fixed a small bug using putIfAbsent, it now uses recursion, if you're a bit scared of stack overflows you can turn the value == null bit into a loop trivially :)
Martin
Why not do something like: Object toCache = CalculateValueForKey(foo); Object atomicPut = table.putIfAbsent(key,toCache); return atomicPut == null ? toCache : atomicPut;No need for recursion there
John V.
that will calculate a new value every time, and since it was specified that reads should be efficient, I tried to avoid that. I shall revise the answer to have a loop (it's really a trivially obvious modification though...)
Martin
I am only referring to the block of code you do the calculation in, that is the if (value == null). If you assign the result from a putIfAbsent and it is not null, then that is the value from table.get(key), one less hash look up
John V.
Ahh, nice idea, added it in. I'm not sure it's really worth saving a single lookup during a read operation, but I did it anyway ;)
Martin
Yea it makes it look slightly more complex. I am only suggesting based on this article http://dmy999.com/article/34/correct-use-of-concurrenthashmap
John V.
This looks good, but: (1) I don't think I need the fromCache at all, since I don't care if the cache gives different copies of the answer (the cache values are immutable and object identity is irrelevant). (2) Why do you use a while loop around `(value == null)`? It seems to me that if the value is initially null, then the value assigned must be non-null assuming `toCache` is non-null, so the loop will execute at most once and can be an `if`. Am I missing something?
Kevin Reid
Yes and no, in the previous revision of this answer, the while loop was needed, the micro optimisation John suggested removes the need for the while loop. However, in very rare circumstances it might return an object which is no longer in the cache. This, of course, isn't a problem. I shall revise the code again to reflect your comments (especially about immutability)
Martin
@Kevin Reid: Don't forget to remove items from your cache when you no longer need them.. you might have a memory leak. Google collections has a map maker that lets you easily create a concurrentHashMap with WeakKeys, so you might want to take advantage of that...
Enno Shioji
Not a significant problem; the keys in my actual application are Java classes, so other than The Stuff You Can Do With ClassLoaders, there isn't much of a problem. But yes, weak keys would help, as provided by http://google-collections.googlecode.com/svn/trunk/javadoc/com/google/common/collect/MapMaker.html
Kevin Reid
+1  A: 

How about something like this SingletonCache class from one of my projects?

public abstract class SingletonCache<K, V> {

    private final ConcurrentMap<K, V> cache = new ConcurrentHashMap<K, V>();

    public V get(K key) {
        V value = cache.get(key);
        if (value == null) {
            cache.putIfAbsent(key, newInstance(key));
            value = cache.get(key);
        }
        return value;
    }

    protected abstract V newInstance(K key);
}

To use it, you extend it and implement the newInstance method which creates a new value in case of a cache miss. Then call the get method with a key to get an instance which corresponds that key. Here is an example of how it is used.

That class guarantees that for each key, only one instance is returned, but the newInstance method may be called multiple times, in which case the first computed instance is used and the rest are discarded. Also note that this cache does not remove old instances, but stores all values indefinitely (in my use case there is a limited number of instances that must be cached). Reading from a ConcurrentHashMap does not use locking, so it should satisfy your efficiency requirement.

Esko Luontola