views:

241

answers:

3

I'm reading from a binary file and want to convert the bytes to US ASCII strings. Is there any way to do this without calling new on String to avoid multiple semantically equal String objects being created in the normal object pool? I'm thinking that it is probably not possible since introducing String objects using double quotes is not possible here. Is this correct?

private String nextString(DataInputStream dis, int size)
throws IOException
{
  byte[] bytesHolder = new byte[size];
  dis.read(bytesHolder);
  return new String(bytesHolder, Charset.forName("US-ASCII")).trim();
+2  A: 

You can call the intern() method on the string to ensure one for the whole JVM.

String s = new String(bytes, "US-ASCII").intern();

You won't avoid creating the initial string again, but you will save on the storage.

That being said, interned strings have a limited storage space, so use with caution. A better option may be to implement a HashMap with the string as the key and value and check if the string already exists and get it if it does, insert it if it doesn't. That way you won't have such memory limitations.

Yishai
Ah, I didn't know about intern(). I'll go check out interning and decide whether it's worth it or not.
Wesho
+2  A: 

You'd have to have a cache mapping byte arrays to strings, then search through the cache for any equal values before creating a new string.

You can intern existing strings with intern() as Yishai posted - that won't stop you from creating more strings, but it'll make all but the first one (for any char sequence) very short lived. On the other hand, it'll make all the distinct strings live for a very long time indeed.

You can have "pseudo-interning" by using a Map<String, String>:

String tmp = new String(bytesHolder, Charset.forName("US-ASCII")).trim();
String cached = cache.get(tmp);
if (cached == null)
{
    cached = tmp;
    cache.put(tmp, tmp);
}
return cached;

You could even put a bit more effort in and end up with an LRU cache so that it'll keep the N most recently fetched strings, discarding others when it needs to.

None of that reduces the number of strings created in the first place, as I say - but is that likely to be a problem in your situation? GCs have been tuned to make it very cheap to create short-lived objects.

Jon Skeet
It's not a problem to have 'new' called if they will be garbage collected, I was more concerned with the number of Strings in the pool. But you are right when saying they may be too 'long lived' using intern(). Something to think about.
Wesho
+1  A: 

You shouldn’t be concerned about it—unless you profiled your application and have determined the String creation to be the exact source of your problem.

If you find out that the String creation is the source of your problem I would recommend what Jon Skeet proposed, i.e. a mapping from byte[] to String. That has about the same effect as interning your Strings while not hogging up valuable memory until you restart the VM.

Bombe