views:

1023

answers:

7

I have a large number of name - value pairs (approx 100k) that I need to store in some sort of cache (say a hash map) where the value is a string with an average of about 30k bytes in size.

Now I know for a fact that a large number of the values have exactly the same string data. In order to avoid having to allocate the identical string data several times, I would like to somehow reuse a previously allocated string thus consuming less memory. In addition this needs to be reasonably fast. i.e. scanning through all the previously allocated values one-by-one is not an option.

Any recommendations on how I could solve this problem?

+8  A: 

String.intern() will help you here (most likely). It will resolve multiple instances of the same string down to one copy.

EDIT: I suggested this would 'most likely' help. In what scenarios will it not ? Interning strings will have the effect of storing those interned string representations permanently. If the problem domain is a one-shot process, this may not be an issue. If it's a long running process (such as a web app) then you may well have a problem.

I would hesitate to say never use interning (I would hesistate to say never do anything). However there are scenarios where it's not ideal.

Brian Agnew
String.intern can be quite slow. It also places the String into the permanent generation, which could well cause GC performance problems.
Tom Hawtin - tackline
The permanent generation is an issue, granted. The question doesn't have the context in which this is to be used. If it's a standalone app, then it may well be ok. Otherwise (say a ong running web app), then not. As ever, solutions need to be evaluated in the context of where they'll be used.
Brian Agnew
@Brian Agnew: My I suggest you edit and expand your answer then to include context? Comments don't count, if you get my drift.
Stu Thompson
+3  A: 

String.intern is the obvious choice as Brian says. But if you don't want to intern across all the String in memory, you can use a Set to first see if the value is present. Here's untested code. You will have to work out removing from reverse map when removing from main

  class Map2<K, V> implements Map<K, V>
  {
    Map<K, V> _map = Maps.newHashMap();
    Set<V, V> _rev = Maps.newHashMap();

    V put(K k, V v) {
      if (_rev.containsKey(v)) {
        V prev = _rev.get(v);
        return _map.put(k, prev);
      } else {
        _rev.put(v, v);
        return _map.put(k,v);
      }
   }
Hemal Pandya
ConcurrentMap has putIfAbsent, which might be useful.
Tom Hawtin - tackline
I like this solution, it's no overkill with weak references etc.To optimize even more on storage, one could just search the existing values in the Map, given that the total number is small (say <10000).Upvote!
Ingo
+4  A: 

Do not use String.intern (there have been various memory issues related to this through the years). instead, create your own cache, similar to String.intern. basically, you want a Map, where each key maps to itself. then, before caching any string, you "intern" it:

private Map<String,WeakReference<String>> myInternMap = new WeakHashMap<String,,WeakReference<String>>();
public String intern(String value) {
  synchronized(myInternMap) {
    WeakReference<String> curRef = myInternMap.get(value);
    String curValue = ((curRef != null) ? curRef.get() : null);
    if(curValue != null) {
      return curValue;
    }

    myInternMap.put(value, new WeakReference<String>(value));
    return value;
  }
}

note, you use weakreferences for the keys and values so that you don't keep references for strings which you are no longer using.

james
james? as in JT?
kdgregory
yep, tis JT. too funny that i wrote your code for you.
james
No, this is very BAD advice. Most such comments refer to rather old issues for now obsolete JVMs.There is absolutely nothing wrong with String.intern() for long-living shared Strings. Much less than issues with roll-your-own replacements.
StaxMan
+1  A: 

It somewhat depends upon how you are creating the String.

One possible way is to use TreeSet that uses a Comparator that can compare existing Strings and the source of your new String. Use SortedSet.tailSet and an Iterator to find an existing String. Or alternatively NavigableSet.ceiling/floor or a TreeMap with a similar setup.

I wrote a weblog entry about another technique to cache immutable objects (in particular Strings), but that is more suitable for smaller objects.

String.intern has performance problems.

Tom Hawtin - tackline
+1  A: 

Agreed with others on not using String.intern(): once you've put a string there, it will never go away. Look to early revisions of Xerces for why this is a bad idea.

A better solution is to use a WeakHashMap, wrapping the value in a WeakReference:

private Map<String,WeakReference<String>> _map 
    = new WeakHashMap<String,WeakReference<String>>();

public synchronized String intern(String str)
{
    WeakReference<String> ref = _map.get(str);
    String s2 = (ref != null) ? ref.get() : null;
    if (s2 != null)
        return s2;
    str = new String(str);
    _map.put(str, new WeakReference(str));
    return str;
}

This code is from an article that I wrote on the Java reference objects. You'll find the explanation there.

EDIT: need to create a new string here (and I'll update the article) because the original might be a substring of a far larger character array. I thought that was fixed around JDK 1.3, but apparently not (at least not in 1.5).

kdgregory
Interning a string will not mean that it will 'never go away', you can garbage collect the perm gen though it may not be so efficient it can and will get garbage collected if there are no strong references to it.
PintSizedCat
The permgen, at least in the Sun JVM, is managed separately from the rest of the heap. If you can point to code that removes strings from the intern table, then I'm willing to retract my statement.
kdgregory
A: 

You could compress the strings. A 30K string should get a good compression ratio. I wrote a hack to compress large String as an exercise, but you could use a byte[] of the compressed data to store the String.

A 30K character string will use about 60KB (2 bytes per character) so even using getBytes() is likely to be an improvement.

Peter Lawrey
A: 

Do you actually need Strings, or do you just need any old CharSequence? If not, then consider implementing a "compact" CharSequence such as the one I suggest in the link.

Neil Coffey