views:

268

answers:

4

My application is multithreaded with intensive String processing. We are experiencing excessive memory consumption and profiling has demonstrated that this is due to String data. I think that memory consumption would benefit greatly from using some kind of flyweight pattern implementation or even cache (I know for sure that Strings are often duplicated, although I don't have any hard data in that regard).

I have looked at Java Constant Pool and String.intern, but it seems that it can provoke some PermGen problems.

What would be the best alternative for implementing application-wide, multithreaded pool of Strings in java?

EDIT: Also see my previous, related question: http://stackoverflow.com/questions/2909848

+2  A: 

This is already done at the JVM level. You only need to ensure that you aren't creating new Strings everytime, either explicitly or implicitly.

I.e. don't do:

String s1 = new String("foo");
String s2 = new String("foo");

This would create two instances in the heap. Rather do so:

String s1 = "foo";
String s2 = "foo";

This will create one instance in the heap and both will refer the same (as evidence, s1 == s2 will return true here).

Also don't use += to concatenate strings (in a loop):

String s = "";
for (/* some loop condition */) {
    s += "new";
}

The += implicitly creates a new String in the heap everytime. Rather do so

StringBuilder sb = new StringBuilder();
for (/* some loop condition */) {
    sb.append("new");
}
String s = sb.toString();

If you can, rather use StringBuilder or its synchronized brother StringBuffer instead of String for "intensive String processing". It offers useful methods for exactly those purposes, such as append(), insert(), delete(), etc. Also see its javadoc.

BalusC
I know. I am wondering if that is the best way to do it? Interning uses PermGen memory and I am told that if it is exhausted, it can be difficult to debug...
Dan
Well, if your problem can't be solved with writing code more efficiently, then you have 2 options: 1) Don't store large data in JVM's memory, but on disk file system or a database and only query it whenever really needed and then garbage it after use. 2) Give the JVM more memory.
BalusC
@BalusC I think problem can be greatly reduced by writing more efficient code, but there are more than one alternative: 1. I can rely on Java Constant Pool (as you suggest) 2. I can roll-your-own flyweight, for starters. I am trying to understand and compare alternatives.
Dan
I don't see how a homegrown flyweight String can solve this. It would only add an extra abstraction layer (and extra overhead). It's already solved with the String pool in JVM. Rather review the code and use `StringBuilder` where applicable.
BalusC
A: 

First, decide how much your application and developers would suffer if you eliminated some of that parsing. A faster application does you no good if you double your employee turnover rate in the process! I think based on your question we can assume you passed this test already.

Second, if you can't eliminate creating an object, then your next goal should be to ensure it doesn't survive Eden collection. And parse-lookup can solve that problem. However, a cache "implemented properly" (I disagree with that basic premise, but I won't bore you with the attendant rant) usually brings thread contention. You'd be replacing one kind of memory pressure for another.

There's a variation of the parse-lookup idiom that suffers less from the sort of collateral damage you usually get from full-on caching, and that's a simple precalculated lookup table (see also "memoization"). The Pattern you usually see for this is the Type Safe Enumeration (TSE). With the TSE, you parse the String, pass it to the TSE to retrieve the associated enumerated type, and then you throw the String away.

Is the text you're processing free-form, or does the input have to follow a rigid specification? If a lot of your text renders down to a fixed set of possible values, then a TSE could help you here, and serves a greater master: Adding context/semantics to your information at the point of creation, instead of at the point of use.

Jason
+4  A: 

I know it goes against what people often tell you, but sometimes explicitly creating new String instances can be a significant way to reduce your memory.

Because Strings are immutable, several methods leverage that fact and share the backing character array to save memory. However, occasionally this can actually increase the memory by preventing garbage collection of unused parts of those arrays.

For example, assume you were parsing the message IDs of a log file to extract warning IDs. Your code would look something like this:

//Format:
//ID: [WARNING|ERROR|DEBUG] Message...
String testLine = "5AB729: WARNING Some really really really long message";

Matcher matcher = Pattern.compile("([A-Z0-9]*): WARNING.*").matcher(testLine);
if ( matcher.matches() ) {
    String id = matcher.group(1);
        //...do something with id...
}

But look at the data actually being stored:

    //...
    String id = matcher.group(1);
    Field valueField = String.class.getDeclaredField("value");
    valueField.setAccessible(true);

    char[] data = ((char[])valueField.get(id));
    System.out.println("Actual data stored for string \"" + id + "\": " + Arrays.toString(data) );

It's the whole test line, because the matcher just wraps a new String instance around the same character data. Compare the results when you replace String id = matcher.group(1); with String id = new String(matcher.group(1));.

Mark Peters
Heed this, if you're doing lots of string manipulation on large strings. String s = hugeString.substring(0,1); hugeString = null; s is not "1 character", it's as large as hugeString was (since it IS hugeString).
Will Hartung
Will, I don't think I follow, substring will create new String instance whose size is 1.
Dan
@Dan: the "size" (as returned by `length()`) is 1. But both Strings reference the same underlying character array, simply recording different start and end indexes of the array that the string refers to. The character array is not copied to the new String; only a reference to the array is.
Mark Peters
@Mark, thanks :)
Dan
A: 

Effeciently pack Strings in memory! I once wrote a hyper memory efficient Set class, where Strings were stored as a tree. If a leaf was reached by traversing the letters, the entry was contained in the set. Fast to work with, too, and ideal to store a large dictionary.

And don't forget that Strings are often the largest part in memory in nearly every app I profiled, so don't care for them if you need them.

Illustration:

You have 3 Strings: Beer, Beans and Blood. You can create a tree structure like this:

B
+-e
  +-er
  +-ans
+-lood

Very efficient for e.g. a list of street names, this is obviously most reasonable with a fixed dictionary, because insert cannot be done efficiently. In fact the structure should be created once, then serialized and afterwards just loaded.

Daniel
sounds interesting, but I am not sure I quite follow. Would you mind illustrating it?
Dan
As I understand your description of it, this is known as a Patricia trie: http://en.wikipedia.org/wiki/Patricia_trie
Mark Peters
+1 Thanks for the name! Now I know how to look for better implementations than mine :).
Daniel