views:

732

answers:

10

How big does a buffer need to be in Java before it's worth reusing?

Or, put another way: I can repeatedly allocate, use, and discard byte[] objects OR run a pool to keep and reuse them. I might allocate a lot of small buffers that get discarded often, or a few big ones that's don't. At what size is is cheaper to pool them than to reallocate, and how do small allocations compare to big ones?

EDIT:

Ok, specific parameters. Say an Intel Core 2 Duo CPU, latest VM version for OS of choice. This questions isn't as vague as it sounds... a little code and a graph could answer it.

EDIT2:

You've posted a lot of good general rules and discussions, but the question really asks for numbers. Post 'em (and code too)! Theory is great, but the proof is the numbers. It doesn't matter if results vary some from system to system, I'm just looking for a rough estimate (order of magnitude). Nobody seems to know if the performance difference will be a factor of 1.1, 2, 10, or 100+, and this is something that matters. It is important for any Java code working with big arrays -- networking, bioinformatics, etc.

Suggestions to get a good benchmark:

  1. Warm up code before running it in the benchmark. Methods should all be called at least 1000 10000 times to get full JIT optimization.
  2. Make sure benchmarked methods run for at least 1 10 seconds and use System.nanotime if possible, to get accurate timings.
  3. Run benchmark on a system that is only running minimal applications
  4. Run benchmark 3-5 times and report all times, so we see how consistent it is.

I know this is a vague and somewhat demanding question. I will check this question regularly, and answers will get comments and rated up consistently. Lazy answers will not (see below for criteria). If I don't have any answers that are thorough, I'll attach a bounty. I might anyway, to reward a really good answer with a little extra.

What I know (and don't need repeated):

  • Java memory allocation and GC are fast and getting faster.
  • Object pooling used to be a good optimization, but now it hurts performance most of the time.
  • Object pooling is "not usually a good idea unless objects are expensive to create." Yadda yadda.

What I DON'T know:

  • How fast should I expect memory allocations to run (MB/s) on a standard modern CPU?
  • How does allocation size effect allocation rate?
  • What's the break-even point for number/size of allocations vs. re-use in a pool?

Routes to an ACCEPTED answer (the more the better):

  • A recent whitepaper showing figures for allocation & GC on modern CPUs (recent as in last year or so, JVM 1.6 or later)
  • Code for a concise and correct micro-benchmark I can run
  • Explanation of how and why the allocations impact performance
  • Real-world examples/anecdotes from testing this kind of optimization

The Context:

I'm working on a library adding LZF compression support to Java. This library extends the H2 DBMS LZF classes, by adding additional compression levels (more compression) and compatibility with the byte streams from the C LZF library. One of the things I'm thinking about is whether or not it's worth trying to reuse the fixed-size buffers used to compress/decompress streams. The buffers may be ~8 kB, or ~32 kB, and in the original version they're ~128 kB. Buffers may be allocated one or more times per stream. I'm trying to figure out how I want to handle buffers to get the best performance, with an eye toward potentially multithreading in the future.

Yes, the library WILL be released as open source if anyone is interested in using this.

+6  A: 

When it is larger than young space.

If your array is larger than the thread-local young space, it is directly allocated in the old space. Garbage collection on the old space is way slower than on the young space. So if your array is larger than the young space, it might make sense to reuse it.

On my machine, 32kb exceeds the young space. So it would make sense to reuse it.

Adrian
What's the performance diff like when it goes from young space to old?
BobMcGee
Depends on your memory settings, in general young space is way smaller than old space, so GC on young space is obviously faster. You can use `-XX:+PrintGCDetails` to check for your benchmark.
Adrian
I would suggest that it makes much more sense to increase the size of the heap / young space.
Stephen C
@stephen true enough, given that OP writes a library though OPs library should adapt to whatever settings it finds predefined, at least IMHO.
Adrian
@adrian - different JVMs have different strategies for "large object" allocation. By talking about thread-local space, it seems that you might be talking about JRockit? For Sun, the young/tenured split is configured, and may be much smaller than the eden size.
kdgregory
You are right, different VMs use different strategies. Personally as OP I would just not reuse the arrays at all. In my post I referred to Sun's JVM. Hotspot uses thread-local space in eden (otherwise it could not just increment a pointer when allocating objects).
Adrian
@Adrian: Theoretically there *is* some point at which pooling is advantageous... presumably you don't intend to reallocate multi-megabyte arrays repeatedly for fast operations. We could be talking 2 kB arrays, or 128 kB arrays, or multi-MB (or even GB) arrays.
BobMcGee
@Adrian (again): again, how big is the performance difference going from old to young space? I'm not asking for a precise figure, but an order of magnitude would be nice.... 4x as fast? 10x as fast? 100x as fast?
BobMcGee
@BobMcGee, I don't no the numbers. I would have to measure myself using `-XX:+PrintGCDetails`. In general gc on the young space is faster but happens more often, whereas gc on the old space is rare and takes longer. So it might really depend on your usecase. The best starting point to learn more is http://java.sun.com/javase/technologies/hotspot/gc/
Adrian
+2  A: 

You've neglected to mention anything about thread safety. If it's going to be reused by multiple threads you'll have to worry about synchronization.

duffymo
I know this. It's another question though, and one like this is enough... the cost of allocation will decide if I pass back and forth small buffers between threads, or reuse them in a queue (or synchronize on one larger array).
BobMcGee
+1 ... the cost of contention can be enormous, potentially far larger than the cost of allocation.
kdgregory
The cost of contention for a buffer pool should be outweighed by the complexity of processing performed on the data (it's fairly memory-bandwidth intensive, and has a couple significant branches which cannot be accurately predicted b/c they depend on data).
BobMcGee
+17  A: 

If you want a simple answer, it is that there is no simple answer. No amount of calling answers (and by implication people) "lazy" is going to help.

How fast should I expect memory allocations to run (MB/s) on a standard modern CPU?

At the speed at which the JVM can zero memory, assuming that the allocation does not trigger a garbage collection. If it does trigger garbage collection, it is impossible to predict without knowing what GC algorithm is used, the heap size and other parameters, and an analysis of the application's working set of non-garbage objects over the lifetime of the app.

How does allocation size effect allocation rate?

See above.

What's the break-even point for number/size of allocations vs. re-use in a pool?

If you want a simple answer, it is that there is no simple answer.

The golden rule is, the bigger your heap is (up to the amount of physical memory available), the smaller the amortized cost of GC'ing a garbage object. With a fast copying garbage collector, the amortized cost of freeing a garbage object approaches zero as the heap gets larger. The cost of the GC is actually determined by (in simplistic terms) the number and size of non-garbage objects that the GC has to deal with.

Under the assumption that your heap is large, the lifecycle cost of allocating and GC'ing a large object (in one GC cycle) approaches the cost of zeroing the memory when the object is allocated.

EDIT: If all you want is some simple numbers, write a simple application that allocates and discards large buffers and run it on your machine with various GC and heap parameters and see what happens. But beware that this is not going to give you a realistic answer because real GC costs depend on an application's non-garbage objects.

I'm not going to write a benchmark for you because I know that it would give you bogus answers.

EDIT 2: In response to the OP's comments.

So, I should expect allocations to run about as fast as System.arraycopy, or a fully JITed array initialization loop (about 1GB/s on my last bench, but I'm dubious of the result)?

Theoretically yes. In practice, it is difficult to measure in a way that separates the allocation costs from the GC costs.

By heap size, are you saying allocating a larger amount of memory for JVM use will actually reduce performance?

No, I'm saying it will increase performance. Significantly. (Provided that you don't run into OS-level virtual memory effects.)

Allocations are just for arrays, and almost everything else in my code runs on the stack. It should simplify measuring and predicting performance.

Maybe. Frankly, I think that you are not going to get much improvement by recycling buffers.

But if you are intent on going down this path, create a buffer pool interface with two implementations. The first is a real thread-safe buffer pool that recycles buffers. The second is dummy pool which simply allocates a new buffer each time alloc is called, and treats dispose as a no-op. Finally, allow the application developer to choose between the pool implementations via a setBufferPool method and/or constructor parameters and/or runtime configuration properties. The application should also be able to supply a buffer pool class / instance of its own making.

Stephen C
So, I should expect allocations to run about as fast as System.arraycopy, or a fully JITed array initialization loop (about 1GB/s on my last bench, but I'm dubious of the result)?By heap size, are you saying allocating a larger amount of memory for JVM use will actually *reduce* performance? If so, I should be benchmarking any optimizations at different memory sizes too...As far as allocations besides buffers, this is quite simple. Allocations are just for arrays, and almost everything else in my code runs on the stack. It should simplify measuring and predicting performance.
BobMcGee
One comment: in the Sun JVM, the size of an object determines whether it is allocated in the young generation (which uses a copying collector) or in the tenured generation (which compacts the heap after sweeping).
kdgregory
So, although this is quite informative, I'm still left with my original question: at what buffer size is it worth pooling buffers (say with a thread-safe pool using WeakReference objects for them to allow GC)?I'm not looking for a 100% exact answer, but more like an order of magnitude. Obviously not worthwhile for 1 kB, but what about 10 kB or 100 kB or even multiple megabytes?
BobMcGee
Is it really just a question of size? For example, suppose the amount of computation done with a given array is small. Then the cost of zeroing the array upon allocation may be significant, and avoiding this cost by reusing allocated arrays may be worthwhile.
GregS
Another comment: go ahead with the benchmark... even if the numbers aren't universally sound, they should give an order-of-magnitude estimate. I really want some figures to go with your excellent discussion of the theory.
BobMcGee
@BobMcGee - sorry, but I don't do other people's research for them for free.
Stephen C
+1  A: 

Keep in mind that cache effects will probably be more of an issue than the cost of "new int[size]" and its corresponding collection. Reusing buffers is therefore a good idea if you have good temporal locality. Reallocating the buffer instead of reusing it means you might get a different chunk of memory each time. As others mentioned, this is especially true when your buffers don't fit in the young generation.

If you allocate but then don't use the whole buffer, it also pays to reuse as you don't waste time zeroing out memory you never use.

Keith Randall
A: 

More important than buffer size is number of allocated objects, and total memory allocated.

  1. Is memory usage an issue at all? If it is a small app may not be worth worrying about.

The real advantage from pooling is to avoid memory fragmentation. The overhead for allocating/freeing memory is small, but the disadvantage is that if you repeatedly allocated many objects of many different sizes memory becomes more fragmented. Using a pool prevents fragmentation.

Larry Watanabe
Modern JVMs use garbage collectors that will compact the heap (via either copy-to-empty-space or explicit compact)
kdgregory
Thanks, but in one sense fragmentation will still affect efficiency because there will be more frequent gc (because of lower utilization of space due to fragmentation). Still, an entirely different order of magnitude of problem than memory fragmentation in a non-memory managed system.
Larry Watanabe
+2  A: 

An answer from a completely different direction: let the user of your library decide.

Ultimately, however optimized you make your library, it will only be a component of a larger application. And if that larger application makes infrequent use of your library, there's no reason that it should pay to maintain a pool of buffers -- even if that pool is only a few hundred kilobytes.

So create your pooling mechanism as an interface, and based on some configuration parameter select the implementation that's used by your library. Set the default to be whatever your benchmark tests determine to be the best solution.1 And yes, if you use an interface you'll have to rely on the JVM being smart enough to inline calls.2


(1) By "benchmark," I mean a long-running program that exercises your library outside of a profiler, passing it a variety of inputs. Profilers are extremely useful, but so is measuring the total throughput after an hour of wall-clock time. On several different computers with differing heap sizes, and several different JVMs, running in single and multi-threaded modes.

(2) This can get you into another line of debate about the relative performance of the various invoke opcodes.

kdgregory
I can add a toggle, but I don't really want users having to mess with a bunch of parameters.If I use a pool, it will be GC'd when no more instances are in use via compressor instances, so I don't really need to worry about it.
BobMcGee
Oh, and I already benchmark compression/decompression with LZF using a variety of both real and synthetic data.
BobMcGee
+1  A: 

I forgot that this is a managed-memory system.

Actually, you probably have the wrong mindset. The appropriate way to determine when it is useful is dependent on the application, system it is running on, and user usage pattern.

In other words - just profile the system, determine how much time is being spent in garbage collection as a percentage of total application time in a typical session, and see if it is worthwhile to optimize that.

You will probably find out that gc isn't even being called at all. So writing code to optimize this would be a complete waste of time.

with today's large memory space I suspect 90% of the time it isn't worth doing at all. You can't really determine this based on parameters - it is too complex. Just profile - easy and accurate.

Larry Watanabe
+1  A: 

Short answer: Don't buffer.

Reasons are follow:

  • Don't optimize it, yet until it become a bottleneck
  • If you recycle it, the overhead of the pool management will be another bottleneck
  • Try to trust the JIT. In the latest JVM, your array may allocated in STACK rather then HEAP.
  • Trust me, the JRE usually do handle them faster and better then you DIY.
  • Keep it simple, for easier to read and debug

When you should recycle a object:

  • only if is it heavy. The size of memory won't make it heavy, but native resources and CPU cycle do, which cost addition finalize and CPU cycle.
  • You may want to recycle them if they are "ByteBuffer" rather then byte[]
Dennis Cheung
The question is basically "what is heavy" when it comes to memory allocations. Memory carries a cost too, just as much as native resources and CPU-heavy processing do. Most of the advice you've offered is too general to be of assistance, or is common knowledge when it comes to Java performance. I think that's covered under "things that I already know." I've read much of the available Java optimization advice online, so it's not new.
BobMcGee
I've read the above answer and they have some point missing.Scalability 1. Multi-core. 2. SWAP of virtual memory. In real world both will happen. Micro-benchmarks are evil. If you are looking for white paper, IBM has one http://www.ibm.com/developerworks/java/library/j-jtp09275.html?ca=dgr-jw22JavaUrbanLegendsWhat I mean is that is it a clear topic: don't do it unless you have a reason. If you have you own pool, it might get swapped. If you pool your object, you create locks and synchronized memory which are heavy. You will spend you time to do something not likely to boost the performance.
Dennis Cheung
If byte[] array pool is good for you, the JIT/JVM should do it for you. It is a reasonable request, and they are already optimize object allocation by object template in memory.
Dennis Cheung
A: 

I think the answer you need is related with the 'order' (measuring space, not time!) of the algorithm.

Copy file example

By example, if you want to copy a file you need to read from an inputstream and write to an outputstream. The TIME order is O(n) because the time will be proportional to the size of the file. But the SPACE order will be O(1) because the program you'll need to do it will ocuppy a fixed ammount of memory (you'll need only one fixed buffer). In this case it's clear that it's convenient to reuse that very buffer you instantiated at the beginning of the program.

Relate the buffer policy with your algorithm execution structure

Of course, if your algoritm needs and endless supply of buffers and each buffer is a different size probably you cannot reuse them. But it gives you some clues:

  • try to fix the size of buffers (even sacrifying a little bit of memory).
  • Try to see what's the structure of the execution: by example, if you're algorithm traverse some kind of tree and you're buffers are related to each node, maybe you only need O(log n) buffers... so you can make an educated guess of the space required.
  • Also if you need diferent buffers but you can arrange things to share diferent segments of the same array... maybe it's a better solution.
  • When you release a buffer you can add it to a pool of buffers. That pool can be a heap ordered by the "fitting" criteria (buffers that fit the most should be first).

What I'm trying to say is: there's no fixed answer. If you instantiated something that you can reuse... probably it's better to reuse it. The tricky part is to find how you can do it without incurring in buffer managing overhead. Here's when the algorithm analysis come in handy.

Hope it helps... :)

helios
+1  A: 

Looking at a micro benchmark (code below) there is no appreciable difference in time on my machine regardless of the size and the times the array is used (I am not posting the times, you can easily run it on your machine :-). I suspect that this is because the garbage is alive for so short a time there is not much to do for cleanup. Array allocation should probably a call to calloc or malloc/memset. Depending on the CPU this will be a very fast operation. If the arrays survived for a longer time to make it past the initial GC area (the nursery) then the time for the one that allocated several arrays might take a bit longer.

code:

import java.util.Random;

public class Main
{
    public static void main(String[] args) 
    {
        final int size;
        final int times;

        size  = 1024 * 128;
        times = 100;

        // uncomment only one of the ones below for each run
        test(new NewTester(size), times);   
//        test(new ReuseTester(size), times); 
    }

    private static void test(final Tester tester, final int times)
    {
        final long total;

        // warmup
        testIt(tester, 1000);
        total = testIt(tester, times);

        System.out.println("took:   " + total);
    }

    private static long testIt(final Tester tester, final int times)
    {
        long total;

        total = 0;

        for(int i = 0; i < times; i++)
        {
            final long start;
            final long end;
            final int value;

            start = System.nanoTime();
            value = tester.run();
            end   = System.nanoTime();
            total += (end - start);

            // make sure the value is used so the VM cannot optimize too much
            System.out.println(value);
        }

        return (total);
    }
}

interface Tester
{
    int run();
}

abstract class AbstractTester
    implements Tester
{
    protected final Random random;

    {
        random = new Random(0);
    }

    public final int run()
    {
        int value;

        value = 0;

        // make sure the random number generater always has the same work to do
        random.setSeed(0);

        // make sure that we have something to return so the VM cannot optimize the code out of existence.
        value += doRun();

        return (value);
    }

    protected abstract int doRun();
}

class ReuseTester
    extends AbstractTester
{
    private final int[] array;

    ReuseTester(final int size)
    {
        array = new int[size];
    }

    public int doRun()
    {
        final int size;

        // make sure the lookup of the array.length happens once
        size = array.length;

        for(int i = 0; i < size; i++)
        {
            array[i] = random.nextInt();
        }

        return (array[size - 1]);
    }
}

class NewTester
    extends AbstractTester
{
    private int[] array;
    private final int length;

    NewTester(final int size)
    {
        length = size;
    }

    public int doRun()
    {
        final int   size;

        // make sure the lookup of the length happens once
        size = length;
        array = new int[size];

        for(int i = 0; i < size; i++)
        {
            array[i] = random.nextInt();
        }

        return (array[size - 1]);
    }
}
TofuBeer