views:

3537

answers:

13

After reading this old article measuring the memory consumption of several object types, I was amazed to see how much memory Strings use in Java:

length: 0, {class java.lang.String} size = 40 bytes
length: 7, {class java.lang.String} size = 56 bytes

While the article has some tips to minimize this, I did not find them entirely satisfying. It seems to be wasteful to use char[] for storing the data. The obvious improvement for most western languages would be to use byte[] and an encoding like UTF-8 instead, as you only need a single byte to store the most frequent characters then instead of two bytes.

Of course one could use String.getBytes("UTF-8") and new String(bytes, "UTF-8"). Even the overhead of the String instance itself would be gone. But then there you lose very handy methods like equals(), hashCode(), length(), ...

Sun has a patent on byte[] representation of Strings, as far as I can tell.

Frameworks for efficient representation of string objects in Java programming environments
... The techniques can be implemented to create Java string objects as arrays of one-byte characters when it is appropriate ...

But I failed to find an API for that patent.

Why do I care?
In most cases I don't. But I worked on applications with huge caches, containing lots of Strings, which would have benefitted from using the memory more efficiently.

Does anybody know of such an API? Or is there another way to keep your memory footprint for Strings small, even at the cost of CPU performance or uglier API?

Please don't repeat the suggestions from the above article:

  • own variant of String.intern() (possibly with SoftReferences)
  • storing a single char[] and exploiting the current String.subString(.) implementation to avoid data copying (nasty ;)

Update

I ran the code from the article on Sun's current JVM (1.6.0_10). It yielded the same results as in 2002.

+1  A: 

Out of curiosity, is the few bytes saved really worth it?

Normally, I suggest ditching strings for performance reasons, in favor of StringBuffer (Remember, Strings are immutable).

Are you seriously exhausting your heap from string references?

FlySwat
Few bytes? For many environments (ASCII only data), Java's storage requirements are slightly more than double the required amount. For large volumes of data, this is indeed a large block of wasted memory.
jsight
As I wrote, in most cases no. But yes, I have written more than one app, where the biggest part of the heap were String instances and the corresponding char[]. The few bytes are several hundreds of MB.
the.duckman
I wouldn't suggest using StringBuffer but if you were going to go that route, you should use StringBuilder as it is unsynchronized vs StringBuffer which is synchronized and is thus much faster in the vast majority of use cases.
Alex Miller
+1  A: 

Just compress them all with gzip. :) Just kidding... but I have seen stranger things, and it would give you much smaller data at significant CPU expense.

The only other String implementations that I'm aware of are the ones in the Javolution classes. I don't think that they are more memory efficient, though:

http://www.javolution.com/api/javolution/text/Text.html
http://www.javolution.com/api/javolution/text/TextBuilder.html

jsight
Zip only works on Strings bigger than some hundreds chars. I did Huffman coding with static lookups once - that worked. But this means, we store the data in byte[] again. Unfortunately the javolution classes are not memory efficient, as a Google code search showed - you were right.
the.duckman
Yes, zip won't work for that reason (headers too big)... but I think gzip crosses over at smaller values, albeit probably still in the 100+ char range. It is kind of surprising that noone has developed one with memory efficiency as the primary goal.
jsight
+6  A: 

I think you should be very cautious about basing any ideas and/or assumptions off of a javaworld.com article from 2002. There have been many, many changes to the compiler and JVM in the six years since then. At the very least, test your hypothesis and solution against a modern JVM first to make sure that the solution is even worth the effort.

matt b
True. I just ran the code from the article on Sun's newest 1.6.0_10 JVM. Same results as 2002.
the.duckman
A: 

I believe that Strings are less memory intensive for some time now, because the Java engineers have implemented the flyweight design pattern to share as much as possible. In fact Strings that have the same value point to the very same object in memory I believe.

nkr1pt
No, they are not. I ran the code from the article on Sun's newest 1.6.0_10 JVM. Same results as 2002.
the.duckman
Yes, nkr1pt, you are right. They often point to the same object in memory, and "abc" and "abcdef" can even point to the same exact array since "length" is stored independently.
Bill K
They can be interned so that all equal strings are shared, but my assumption is that he wasn't wanting to do that (possibly long strings with not much duplication?). Large strings are not automatically shared.
jsight
Sorry, my answer was not precise enough. I meant: No they are not "less memory intensive for some time now". And yes, you are right in a special case: The compilers are clever enough nowadays to merge equal String instances in a single Class to the same instance. That's why "a"=="a" yields true.
the.duckman
A: 

You said not to repeat the article's suggestion of rolling your own interning scheme, but what's wrong with String.intern itself? The article contains the following throwaway remark:

Numerous reasons exist to avoid the String.intern() method. One is that few modern JVMs can intern large amounts of data.

But even if the memory usage figures from 2002 still hold six years later, I'd be surprised if no progress has been made on how much data JVMs can intern.

This isn't purely a rhetorical question - I'm interested to know if there are good reasons to avoid it. Is it implemented inefficiently for highly-multithreaded use? Does it fill up some special JVM-specific area of the heap? Do you really have hundreds of megabytes of unique strings (so interning would be useless anyway)?

Sam Stokes
Some time ago I read that interned Strings are stored in the PermGen and are never freed again. Don't know how this is today. This page http://wiki.eclipse.org/index.php/Performance_Bloopers lists using String.intern() as a blooper in the implementation of Eclipse 3.0.
the.duckman
Good ? regarding permgen... I don't know if the VMs do that or not. I think most of the time the problem with inter is just that the strings you are interning end up not being duplicated as much as you think. The intern() calls can end up destroying your perf gains. Or perhaps depending on use.
jsight
the problem with indiscriminate use of intern() is that interned strings can not be garbage collected (i.e. permgen). In other words, a memory leak.
Kevin Day
+1  A: 

There is the overhead of creating an object (at least a dispatch table), the overhead of the fact that it uses 2 bytes per letter, and the overhead of a few extra variables in there that are created to actually improve speed and memory usage in many cases.

If you are going to use OO programming, this is the cost of having clear, usable, maintainable code.

For an answer besides the obvious (which is that if memory usage is that important, you should probably be using C), you could implement your own Strings with an internal representation in BCD byte-arrays.

That actually sounds fun, I might do it just for kicks :)

A Java array takes 2 bytes per item. A BCD encoded digit takes 6 bits per letter IIRC, making your strings significantly smaller. There would be a little conversion cost in time, but not too bad really. The really big problem is that you'd have to convert to string to do anything with it.

You still have the overhead of an object instance to worry about... but that would be better addressed by revamping your design than trying to eliminate instances.

Finally a note. I'm completely against deploying anything like this unless you have 3 things:

  • An implementation done the most readable way
  • Test results and requirements showing how that implementation doesn't meet requirements
  • Test results on how the "improved" implementation DOES meet requirements.

Without all three of those, I'd kick any optimized solution a developer presented to me.

Bill K
+1  A: 

Java chose UTF-16 for a compromise of speed and storage size. Processing UTF-8 data is much more PITA than processing UTF-16 data (e.g. when trying to find the position of character X in the byte array, how are you going to do so in a fast manner, if every character can have one, two, three or even up to six bytes? Ever thought about that? Going over the string byte by byte is not really fast, you see?). Of course UTF-32 would be easiest to process, but waste twice the storage space. Things have changed since the early Unicode days. Now certain characters need 4 byte, even when UTF-16 is used. Handling these correctly make UTF-16 almost equally bad as UTF-8.

Anyway, rest assured that if you implement a String class with an internal storage that uses UTF-8, you might win some memory, but you will lose processing speed for many string methods. Also your argument is a way too limited point of view. Your argument will not hold true for someone in Japan, since Japanese characters will not be smaller in UTF-8 than in UTF-16 (actually they will take 3 bytes in UTF-8, while they are only two bytes in UTF-16). I don't understand why programmers in such a global world like today with the omnipresent Internet still talk about "western languages", as if this is all that would count, as if only the western world has computers and the rest of it lives in caves. Sooner or later any application gets bitten by the fact that it fails to effectively process non-western characters.

Mecki
+4  A: 

An internal UTF-8 encoding has its advantages (such as the smaller memory footprint that you pointed out), but it has disadvantages too.

For example, determining the character-length (rather than the byte-length) of a UTF-8 encoded string is an O(n) operation. In a java string, the cost of determining the character-length is O(1), while generating the UTF-8 representation is O(n).

It's all about priorities.

Data-structure design can often be seen as a tradeoff between speed and space. In this case, I think the designers of the Java string API made a choice based on these criteria:

  • The String class must support all possible unicode characters.

  • Although unicode defines 1 byte, 2 byte, and 4-byte variants, the 4-byte characters are (in practice) pretty rare, so it's okay to represent them as surrogate pairs. That's why java uses a 2-byte char primitive.

  • When people call length(), indexOf(), and charAt() methods, they're interested in the character position, not the byte position. In order to create fast implementations of these methods, it's necessary to avoid the internal UTF-8 encoding.

  • Languages like C++ make the programmer's life more complicated by defining three different character types and forcing the programmer to choose between them. Most programmers start off using simple ASCII strings, but when they eventually need to support international characters, the process of modifying the code to use multibyte characters is extremely painful. I think the Java designers made an excellent compromise choice by saying that all strings consist of 2-byte characters.

benjismith
I do not criticize the default implementation of String. I totally agree with you on all points. But there are usecases, where you are ready to sacrifice cpu performance for memory efficiency. The fact that Sun has a patent on the issue supports my argument, I feel.
the.duckman
Well, I suppose you could just pass your strings around as byte arrays and then use a CharsetDecoder to convert them to strings on demand. I agree that it'd be nice if the String class provided a constructor that would do it for you, but I don't think it'd be worth having a whole different class.
benjismith
+9  A: 

The article points out two things:

  1. Character arrays increase in chunks of 8 bytes.
  2. There is a large difference in size between char[] and String objects.

The overhead is due to including a char[] object reference, and three ints: an offset, a length, and space for storing the String's hashcode, plus the standard overhead of simply being an object.

Slightly different from String.intern(), or a character array used by String.substring() is using a single char[] for all Strings, this means you do not need to store the object reference in your wrapper String-like object. You would still need the offset, and you introduce a (large) limit on how many characters you can have in total.

You would no longer need the length if you use a special end of string marker. That saves four bytes for the length, but costs you two bytes for the marker, plus the additional time, complexity, and buffer overrun risks.

The space-time trade-off of not storing the hash may help you if you do not need it often.

For an application that I've worked with, where I needed super fast and memory efficient treatment of a large number of strings, I was able to leave the data in its encoded form, and work with byte arrays. My output encoding was the same as my input encoding, and I didn't need to decode bytes to characters nor encode back to bytes again for output.

In addition, I could leave the input data in the byte array it was originally read into - a memory mapped file.

My objects consisted of an int offset (the limit suited my situation), an int length, and an int hashcode.

java.lang.String was the familiar hammer for what I wanted to do, but not the best tool for the job.

Stephen Denne
+7  A: 

At Terracotta, we have some cases where we compress big Strings as they are sent around the network and actually leave them compressed until decompression is necessary. We do this by converting the char[] to byte[], compressing the byte[], then encoding that byte[] back into the original char[]. For certain operations like hash and length, we can answer those questions without decoding the compressed string. For data like big XML strings, you can get substantial compression this way.

Moving the compressed data around the network is a definite win. Keeping it compressed is dependent on the use case. Of course, we have some knobs to turn this off and change the length at which compression turns on, etc.

This is all done with byte code instrumentation on java.lang.String which we've found is very delicate due to how early String is used in startup but is stable if you follow some guidelines.

Alex Miller
A: 

Remember that there are many types of compression. Using huffman encoding is a good general purpose approach - but it is relatively CPU intensive. For a B+Tree implementation I worked on a few years back, we knew that the keys would likely have common leading characters, so we implemented a leading character compression algorithm for each page in the B+Tree. The code was easy, very, very fast, and resulted in a memory usage 1/3 of what we started with. In our case, the real reason for doing this was to save space on disk, and reduce time spent on disk -> RAM transfers (and that 1/3 savings made a huge difference in effective disk performance).

The reason that I bring this up is that a custom String implementation wouldn't have helped very much here. We were only able to achieve the gains we did because we worked the layer of the container that the strings live in.

Trying to optimize a few bytes here and there inside the String object may not be worth it in comparison.

Kevin Day
A: 

I'm currently implementing a compression method as follows (I'm working on an app that needs to store a very large number of documents in memory so we can do document-to-document computation):

  • Split up the string into 4-character "words" (if you need all Unicode) and store those bytes in a long using masking/bit shifting. If you don't need the full Unicode set and just the 255 ASCII characters, you can fit 8 characters into each long. Add (char) 0 to the end of the string until the length divides evenly by 4 (or 8).
  • Override a hash set implementation (like Trove's TLongHashSet) and add each "word" to that set, compiling an array of the internal indexes of where the long ends up in the set (make sure you also update your index when the set gets rehashed)
  • Use a two-dimensional int array to store these indexes (so the first dimension is each compressed string, and the second dimension is each "word" index in the hash set), and return the single int index into that array back to the caller (you have to own the word arrays so you can globally update the index on a rehash as mentioned above)

Advantages:

  • Constant time compression/decompression
  • A length n string is represented as an int array of length n/4, with the additional overhead of the long word set which grows asymptotically as fewer unique "words" are encountered
  • The user is handed back a single int string "ID" which is convenient and small to store in their objects

Distadvantages:

  • Somewhat hacky since it involves bit shifting, messing with the internals of the hash set, etc. (Bill K would not approve)
  • Works well when you don't expect a lot of duplicate strings. It's very expensive to check to see if a string already exists in the library.
J. Dimeo
A: 

Today (2010), each GB you add to a server costs about £80 or $120. Before you go re-engineering the String, you should ask yourself it is really worth it.

If you are going to save a GB of memory, perhaps. Ten GB, definitiely. If you want to save 10s of MB, you are likely to use more time than its worth.

How you compact the Strings really depends on your usage pattern. Are there lots of repeated strings? (use an object pool) Are there lots of long strings? (use compression/encoding)

Another reason you might want smaller strings is to reduce cache usage. Even the largest CPUs have about 8 MB - 12 MB of cache. This can be a more precious resource and not easily increased. In this case I suggest you look at alternatives to strings, but you must have in mind how much difference it will make in £ or $ against the time it takes.

Peter Lawrey