views:

155

answers:

5

I've seen many primitive examples describing how String intern()'ing works, but I have yet to see a real-life use-case that would benefit from it.

The only situation that I can dream up is having a web service that receives a considerable amount of requests, each being very similar in nature due to a rigid schema. By intern()'ing the request field names in this case, memory consumption can be significantly reduced.

Can anyone provide an example of using intern() in a production environment with great success? Maybe an example of it in a popular open source offering?

Edit: I am referring to manual interning, not the guaranteed interning of String literals, etc.

A: 

Not a complete answer but additional food for thought (found here):

Therefore, the primary benefit in this case is that using the == operator for internalized strings is a lot faster than use the equals() method [for not internalized Strings]. So, use the intern() method if you're going to be comparing strings more than a time or three.

Andreas_D
This is true, but there are many exceptions to this generalization:- If the odds of your Strings being the same length are very minimal, and the number of Strings you'll possibly be intern()'ing is high, one could argue that since equals() does a size check first, you're unnecessarily exposing yourself to PermGen OOM Exceptions.
Tom N
Your're right, but performance-wise you have O(n) for equals and O(1) for `==`. I agree, that the worst-case only happens if both Strings are of equal size and differ on the last char only. Which is usually a pretty rare case.
Andreas_D
The answer is incorrect. The first thing that String.equals does is check for equality of reference, before checking for semantic equality. So for two internalized strings == and .equals are, well, equal....
Visage
@Visage - Hey, don't downvoted me, downvote the guy from jGuru ;) But you're right, the copied text is incorrected. I'll edit the quote to what I believe is what the author wanted to say.
Andreas_D
+1  A: 

We had a production system that processes literally millions of pieces of data at a time, many of which have string fields. We should have been interning strings, but there was a bug which meant we were not. By fixing the bug we avoided having to do a very costly (at least 6 figures, possibly 7) server upgrade.

Visage
Can you be more specific?e.g. What kind of data? Was it user driven or internal/cron driven? What was being done with the data? etc. With this level of detail the example will be a bit more clear. Thanks!
Tom N
Im limited by what I can disclose, but essentially it was financial transaction processing. We read in a whole load of data from a massive database and do large scale date-warehousing type operations on it to discern aggregate aspects. Some textual fields in the data were not being interned on reading from the DB, leading to massive memory bloat and a big reduction in our processing capacity.
Visage
A: 

Never, ever, use intern on user-supplied data, as that can cause denial of service attacks (as intern()ed strings are never freed). You can do validation on the user-supplied strings, but then again you've done most of the work needed for intern().

Tassos Bassoukos
Your point on intern()'ed Strings not being freed is incorrect (depending upon the JVM). Most relevant JVMs use weak references to ensure gc.
Tom N
+8  A: 

Interning can be very beneficial if you have N strings that can take only K different values, where N far exceeds K. Now, instead of storing N strings in memory, you will only be storing up to K.

For example, you may have an ID type which consists of 5 digits. Thus, there can only be 10^5 different values. Suppose you're now parsing a large document that has many references/cross references to ID values. Let's say this document have 10^9 references total (obviously some references are repeated in other parts of the documents).

So N = 10^9 and K = 10^5 in this case. If you are not interning the strings, you will be storing 10^9 strings in memory, where lots of those strings are equals (by Pigeonhole Principle). If you intern() the ID string you get when you're parsing the document, and you don't keep any reference to the uninterned strings you read from the document (so they can be garbage collected), then you will never need to store more than 10^5 strings in memory.

polygenelubricants
I believe this to be a near perfect assessment, thanks for abstracting it out polygenelubricants. My difficulty in coming up with a tangible example lies with the fact that even in the above case, more often than not you can stream the input data and do work on it in chunks vs. all at once. Streaming vs. intern()'ing (if applicable) would almost always be preferable assuming negligible network latency/impact in the case of a remote source. Thing is, I've never seen a use-case that meets the threshold of Strings necessary to consider intern(), but cannot be streamed and divide and conquered.
Tom N
@Tom: see also related http://stackoverflow.com/questions/1356341/will-interning-strings-help-performance-in-a-parser - this is also parser related, and motivated by the same Pigeonhole principle. An XML document may have one million `<item>` elements, but probably only very few element types. You can intern the element names so that `"item"` only appears once in memory (not counting the temporary garbage instances which is immediately let go in preference of its `intern()` representative).
polygenelubricants
+1  A: 

Examples where interning will be beneficial involve a large numbers strings where:

  • the strings are likely to survive multiple GC cycles, and
  • there are likely to be multiple copies of a large percentage of the Strings.

Typical examples involve splitting / parsing a text into symbols (words, identifiers, URIs) and then attaching those symbols to long-lived data structures. XML processing, programming language compilation and RDF / OWL triple stores spring to mind as applications where interning is likely to be beneficial.

But interning is not without its problems, especially if it turns out that the assumptions above are not correct:

  • the pool data structure used to hold the interned strings takes extra space,
  • interning takes time, and
  • interning doesn't prevent the creation of the duplicate string in the first place.

Finally, interning potentially increases GC overheads by increasing the number of objects that need to be traced and copied, and by increasing the number of weak references that need to be dealt with. This increase in overheads has to be balanced against the decrease in GC overheads that results from effective interning.

Stephen C