views:

353

answers:

4

I know that Java collections are very memory-hungry, and did a test myself, proving that 4GB is barely enough to store few millions of Integers into a HashSet.

But what if I has "enough" memory? What would happen to Collection.size()?

EDIT: Solved: Collection.size() returns Integer.MAX when the integer range is exceeded.
New question: how to determine the "real" count of elements of a collection then?

NOTE 1: Sorry, this is probably a let-me-google-it-for-you question, but I really didn't find anything ;)

NOTE 2: As far as I understand it, each integer entry of a set is: reference + cached_hashcode + boxed_integer_object + real_int_value, right?

NOTE 3: Funny, even with JDK7 and "compressed pointers", when the JVM uses 2GB of real memory, it shows only 1.5GB allocated memory in VisualVM.

For those who care:

Test sources:

import java.util.*;
import java.lang.management.*;

public final class _BoxedValuesInSetMemoryConsumption {
  private final static int MILLION = 1000 * 1000;

  public static void main(String... args) {
    Set<Integer> set = new HashSet<Integer>();

    for (int i = 1;; ++i) {
      if ((i % MILLION) == 0) {
        int milsOfEntries = (i / MILLION);
        long mbytes = ManagementFactory.getMemoryMXBean().
            getHeapMemoryUsage().getUsed() / MILLION;
        int ratio = (int) mbytes / milsOfEntries;
        System.out.println(milsOfEntries + " mil, " + mbytes + " MB used, "
            + " ratio of bytes per entry: " + ratio);
      }

      set.add(i);
    }
  }
}

Execution parameters:

Tested with x64 version of JDK7 build 105 under OpenSuse 11.3 x64.

-XX:+UseCompressedOops -Xmx2048m

Output result:

1 mil, 56 MB used,  ratio of bytes per entry: 56
2 mil, 113 MB used,  ratio of bytes per entry: 56
3 mil, 161 MB used,  ratio of bytes per entry: 53
4 mil, 225 MB used,  ratio of bytes per entry: 56
5 mil, 274 MB used,  ratio of bytes per entry: 54
6 mil, 322 MB used,  ratio of bytes per entry: 53
7 mil, 403 MB used,  ratio of bytes per entry: 57
8 mil, 452 MB used,  ratio of bytes per entry: 56
9 mil, 499 MB used,  ratio of bytes per entry: 55
10 mil, 548 MB used,  ratio of bytes per entry: 54
11 mil, 596 MB used,  ratio of bytes per entry: 54
12 mil, 644 MB used,  ratio of bytes per entry: 53
13 mil, 827 MB used,  ratio of bytes per entry: 63
14 mil, 874 MB used,  ratio of bytes per entry: 62
15 mil, 855 MB used,  ratio of bytes per entry: 57
16 mil, 902 MB used,  ratio of bytes per entry: 56
17 mil, 951 MB used,  ratio of bytes per entry: 55
18 mil, 999 MB used,  ratio of bytes per entry: 55
19 mil, 1047 MB used,  ratio of bytes per entry: 55
20 mil, 1096 MB used,  ratio of bytes per entry: 54
21 mil, 1143 MB used,  ratio of bytes per entry: 54
22 mil, 1191 MB used,  ratio of bytes per entry: 54
23 mil, 1239 MB used,  ratio of bytes per entry: 53
24 mil, 1288 MB used,  ratio of bytes per entry: 53
25 mil, 1337 MB used,  ratio of bytes per entry: 53
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

At the end, about 2 GiB real memory were used, instead of displayed 1.3 GiB, so the consumption for each entry is even larger than 53 bytes.

+12  A: 

I know that Java collections are very memory-hungry, and did a test myself, proving that 4GB is barely enough to store few millions of Integers into a HashSet.

Java Heap != System Memory. Java's default heap size is only 128MB. Note this is also different from the memory the JVM uses.

Regarding your question: from the docs,

public int size()

Returns the number of elements in this collection. If this collection contains more than Integer.MAX_VALUE elements, returns Integer.MAX_VALUE.

quantumSoup
Good, I should have looked into the javadoc first, typical mistake. Thank you!
java.is.for.desktop
Now, I changed the question (how to determine real amount of elements), so I can't accept that as correct answer. Sorry, I should have started a new question, but now, with all these comments, I think it's too late now.
java.is.for.desktop
Another way to find a method's behaviour is to download the source (you may have it already, look for src.zip in your sdk directory) and look at the code directly. If you're using eclipse you can even tell eclipse where to find this file so that you can Ctrl-click to the source of all library methods.
Michael Clerx
+3  A: 

From the source code:

 /**
 * Returns the number of elements in this collection.  If this collection
 * contains more than <tt>Integer.MAX_VALUE</tt> elements, returns
 * <tt>Integer.MAX_VALUE</tt>.
 * 
 * @return the number of elements in this collection
 */
int size();
Ido
+4  A: 

Your question seems to have a quite different content than the title.

You already answered the question in the title (Integer.MAX_VALUE is returned). And no: there's no way you can find out the "true" size with the normal APIs safe for iterating over the collection and counting (using a long of course).

If you want to store a Set of int values and you know that the range and amount of values can become very big, then a BitSet might actually be a better implementation:

import java.util.*;
import java.lang.management.*;

public final class IntegersInBitSetMemoryConsumption {
  private final static int MILLION = 1000 * 1000;

  public static void main(String... args) {
    BitSet set = new BitSet(Integer.MAX_VALUE);

    for (int i = 1;; ++i) {
      if ((i % MILLION) == 0) {
        int milsOfEntries = (i / MILLION);
        long mbytes = ManagementFactory.getMemoryMXBean().
            getHeapMemoryUsage().getUsed() / MILLION;
        double ratio = mbytes / milsOfEntries;
        System.out.println(milsOfEntries + " mil, " + mbytes + " MiB used, "
            + " ratio of bytes per entry: " + ratio);
      }

      set.set(i);
    }
  }
}

This will produce a constant-size data structure that can hold all values inside the range without changing size and occupying a relatively small amount of memory (1 bit per possible value plus some overhead).

This method has two drawbacks, however:

  • it doesn't support negative int values
  • it doesn't provide the Set API

Both can easily be worked around by writing a wrapper that uses two BitSet objects (possibly lazily allocated) to hold the positive and negative value range respectively and implements adapter methods for the Set interface.

Joachim Sauer
Thank you for the hint! But, actually, I was using a `Set` of `Integers` just to demonstrate the "problem" of overhead and memory growing faster than expected.
java.is.for.desktop
@java: in that case the solution is simple: don't use wrappers when primitive values are sufficient.
Joachim Sauer
Ok, if there is no other method than counting, this is too an answer ;) -- I wonder if there is an RFE in java bugs for something like `Collections.sizeLong(Collection<?> col)`...
java.is.for.desktop
@java: in the few cases where your collections are really bigger than that, you probably don't want to loop through it anyway and knowing the exact size is not actually that important. In fact, I think the best solution would be an additional `HugeCollection` interface that collections that are to be expected of that size could implement.
Joachim Sauer
A: 

The generic answer for any real processor architecture is that you just cannot. The reason is simple: there can't be more allocated objects (of at least 1 word size) than addressable memory.

Of course, given the virtual nature of the JVM, there's a scenario where that can happen. int will always be 32bit signed, and you can implement and run the JVM on top a 64bit machine where more than 2GB of memory can be addressed.

In that case, the documentation tells us that Integer.MAX_INT would be returned... And that's a big problem, because any loop that used an integer variable relying on i < col.size() to stop would run forever (although I think that anything that loops 2**31-1 times would take long enough to make you want to kill the process anyway).

fortran