views:

153

answers:

7

I understand the basic idea of java's String interning, but I'm trying to figure out which situations it happens in, and which I would need to do my own flyweighting.

Somewhat related:

Together they tell me that String s = "foo" is good and String s = new String("foo") is bad but there's no mention of any other situations.

In particular, if I parse a file (say a csv) that has a lot of repeated values, will Java's string interning cover me or do I need to do something myself? I've gotten conflicting advice about whether or not String interning applies here in my other question


The full answer came in several fragments, so I'll sum up here:

By default, java only interns strings that are known at compile-time. String.intern(String) can be used at runtime, but it doesn't perform very well, so it's only appropriate for smaller numbers of Strings that you're sure will be repeated a lot. For larger sets of Strings it's Guava to the rescue (see ColinD's answer).

+1  A: 

In most cases, string is created from byte or char array (unless it's a string literal in the code), so you can test it.

    String s = "test";
    String s1 = new String(s.getBytes());
    String s2 = String.valueOf(s.toCharArray());
    String s3 = new String(s.toCharArray());

    System.out.println(s == s1);
    System.out.println(s == s2);
    System.out.println(s == s3);

Prints false for all. But you can explicitly intern string, if you thing you'll have a lot of repeating values. If you add this to above example, it'll print true for all three comparisons

    s1 = s1.intern();
    s2 = s2.intern();
    s3 = s3.intern();

See String#intern description in the API.

edit
So would using intern() on each value that's read in be a reasonable way to achieve flyweighting?
Yes, assuming there're no references held to the old string. If old string reference isn't used anywhere anymore, it'll be garbage-collected.

Nikita Rybak
So would using `intern()` on each value that's read in be a reasonable way to achieve flyweighting?
bemace
+1  A: 

Reading the String javadoc

All literal strings and string-valued constant expressions are interned.

That leads me to believe that strings you get from a file, after your program has been compiled, won't be interned automatically.

If you said something like,

String x = "string";

that would be interned by the compiler because it's visible at compile time.

If you know that certain strings are very common in your input file you can call

stringFromFile.intern();

and that particular string will be added to the intern pool for later use. You could even pre-cache them by putting calls to intern in the main or static portion of your code.

You could try an experiment on your particular input and see what would happen in the best case if you manually intern some data and compare that to the default no-intern behavior.

Paul Rubel
+1  A: 

As far as I am aware, string interning happens automatically for String literals only, all others have to be programatically interned using the {@link java.lang.String#intern()} method. Thus constructing a String via its constructor using an already interned String literal produces a new String which isn't interned but containing the same content as the interned literal on which it was constructed.

I found a good basic overview of interning (might be a bit basic, but still explains it just fine) on javatechniques.com.

micdah
+3  A: 

Don't use String.intern() in your code. At least not if you might get 20 or more different strings. In my experience using String.intern slows down the whole application when you have a few millions strings.

To avoid duplicated String objects, just use a HashMap.

private final Map<String, String> pool = new HashMap<String, String>();

private void interned(String s) {
  String interned = pool.get(s);
  if (interned != null) {
    return interned;
  pool.put(s, s);
  return s;
}

private void readFile(CsvFile csvFile) {
  for (List<String> row : csvFile) {
    for (int i = 0; i < row.size(); i++) {
      row.set(i, interned(row.get(i)));
      // further process the row
    }
  }
  pool.clear(); // allow the garbage collector to clean up
}

With that code you can avoid duplicate strings for one CSV file. If you need to avoid them on a larger scale, call pool.clear() in another place.

Roland Illig
Why a map and not a set? HashSet seems a better choice to me.
Chris Knight
If you use a set, how do you get the interned version back out? i.e., what would you replace `pool.get()` with?
andersoj
@Roland Illig: I use an approach like this with `WeakReference<String>` to implement a cache for a long-running application that processes a lot of identical strings from network messages.
andersoj
Note that using a `HashMap` for this isn't thread safe (though for this example, thread safety obviously isn't needed). A `ConcurrentMap` and `putIfAbsent` should be used if that's needed. Guava does this for you in its `Interner` implementations.
ColinD
You may consider using WeakHashMap. Also see a related problem with Strings here - http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4513622.
Jayan
+1  A: 

This information may be out of date, and I no longer have the code to back it up...

(what isn't out of date):

Reading in strings via a Scanner, Reader, etc... are not interned. Only String literals are interned (of course that is up to the implementation, I don't think there is anything that says they cannot be interned).

(what may be out of date):

I wrote a program that I wanted to be fast, and use as little memory as possible. I tried with and without intern on each read of a String from a file. The intern way to significantly longer than not using intern, so much so that I decided to not do the intern. If performance matters try timing your code with/without intern. You may also want to check the memory usage (a profiler will be good for that) with/without intern and see if the tradeoff makes a difference to you.

TofuBeer
+2  A: 

One option Guava gives you here is to use an Interner rather than using String.intern(). Unlike String.intern(), a Guava Interner uses the heap rather than the permanent generation. Additionally, you have the option of interning the Strings with weak references such that when you're done using those Strings, the Interner won't prevent them from being garbage-collected. If you use the Interner in such a way that it's discarded when you're done with the strings, though, you can just use strong references with Interners.newStrongInterner() instead for possibly better performance.

Interner<String> interner = Interners.newWeakInterner();
String a = interner.intern(getStringFromCsv());
String b = interner.intern(getStringFromCsv());
// if a.equals(b), a == b will be true
ColinD
This definitely worked out well. Loading a test file with 100,000 records memory usage dropped from 194MB to 128MB (used by application, checked after running GC), and average loading time dropped from 14s to 11s.
bemace
@bemace: Cool, glad to hear it.
ColinD
+1  A: 

When to intern a string? When you know you are going to have LOTS of strings with a LOW cardinality in a given place.

For example... batch processing code. You plan to process 100 million rows, many of the POJOs that are created have a field (say a CITY field on a person Object) that will only be one of a few possible answers (New York, Chicago, etc.). Too many choices to do an ENUM, but you really don't need to create 45 million strings that say New York. You COULD use interning or some kind of home rolled variation (weak reference map is probably better than String.intern) to reduce your memory footprint.

You can save memory space at the cost of possible CPU work... could be worth it in some places, but hard to say. GC is pretty fast, your duplicate strings will get GCed as soon as they are done being used.

So if you ever get in a place where you are running into a memory wall, and have Strings with a low cardinality... you could consider interning.

bwawok