ansaurus

Question

Answer 1

+1 A:

Memory allocation can cause side-effects. Is it possible that the memory allocation is causing the heap to get compacted? Have you looked to see if your memory allocation is causing the GC to run at the same time?

Have you separately timed how long it takes to create the new ArrayLists?

RickNZ 2009-11-18 11:06:42

Answer 2

+1 A:

There is probably no hope in expecting microsecond-latency guarantees from a general-purpose VM running on a general-purpose OS, even with such great hardware. Massive throughput is the best you can hope for. How about switching to a real-time VM if you need one (I'm talking RTSJ and all that...)

...my two cents

Gyom 2009-11-18 11:12:05

Matt 2009-11-18 21:08:51

Answer 3

+2 A:

Since your question was more about how to go about investigating the problem rather than 'what is my problem' I'll stick with some tools to try out.

A very useful tool for getting a better idea of what is going on and when is BTrace. It is similar to DTrace however a pure java tool. On that note I'm assuming you know DTrace, if not that is also useful if not obtuse. These will give you some visibility about what is happening and when in the JVM and OS.

Oh, another thing to clarify in your original posting. What collector are you running? I'm assuming with high latency problem that you are using a low pause collector like CMS. If so have you tried any tuning?

reccles 2009-11-18 13:45:36

Matt 2009-11-18 21:13:21

Answer 4

+1 A:

Just some wild guesses:

It's my understanding that Java VMs handle the memory of short lived objects differently from long-term objects. It would seem reasonable to me that at the point where an object goes from having one function-local reference to having references in the global heap would be a big event. Instead of being available for cleanup at function exit, it now has to be tracked by the GC.

Or it could be that going from one reference to multiple references to a single object has to change the GC accounting. As long as an object has a single reference it is easy to clean up. Multiple references can have reference loops and/or the GC may have to search for the reference in all other objects.

Zan Lynx 2009-12-23 15:54:30

Matt 2009-12-26 19:12:27

Matt: regarding reference counting: there is reference counting, and there is reference counting... A GC that knows the difference between 1 and Many references can make things much easier on itself. When cleaning objects that hold the one and only reference to another object, that object can also be immediately collected.

Zan Lynx 2009-12-27 01:46:26

right but the cost (or saving) in that case would be at collect time as opposed to allocation time

Matt 2009-12-28 15:26:44

Answer 5

+1 A:

When you are repeating the same task many times, your CPU tends to run very efficiently. This is because your cache miss times and warm up of the CPU doesn't appear as a factor. Its is also possible you are not considering your JVM warm time either.

If you try the same thing when the JVM and/or the CPU is not warmed up. You will get a very different results.

Try doing the same thing say 25 times (Less than your compile threshold) and sleep(100) between tests. You should expect to see much higher times, closer to what you see in the real application.

The behaviour of your app will differ but to illustrate my point. I have found waiting for IO can be more disruptive than a plain sleep.

When you performing your benchmark, you should try to ensure you are comparing like with like.

import java.io.*;
import java.util.Date;

/**
Cold JVM with a Hot CPU took 123 us average
Cold JVM with a Cold CPU took 403 us average
Cold JVM with a Hot CPU took 314 us average
Cold JVM with a Cold CPU took 510 us average
Cold JVM with a Hot CPU took 316 us average
Cold JVM with a Cold CPU took 514 us average
Cold JVM with a Hot CPU took 315 us average
Cold JVM with a Cold CPU took 545 us average
Cold JVM with a Hot CPU took 321 us average
Cold JVM with a Cold CPU took 542 us average
Hot JVM with a Hot CPU took 44 us average
Hot JVM with a Cold CPU took 111 us average
Hot JVM with a Hot CPU took 32 us average
Hot JVM with a Cold CPU took 96 us average
Hot JVM with a Hot CPU took 26 us average
Hot JVM with a Cold CPU took 80 us average
Hot JVM with a Hot CPU took 26 us average
Hot JVM with a Cold CPU took 90 us average
Hot JVM with a Hot CPU took 25 us average
Hot JVM with a Cold CPU took 98 us average
 */
public class HotColdBenchmark {
    public static void main(String... args) {
        // load all the classes.
        performTest(null, 25, false);
        for (int i = 0; i < 5; i++) {
            // still pretty cold
            performTest("Cold JVM with a Hot CPU", 25, false);
            // still pretty cold
            performTest("Cold JVM with a Cold CPU", 25, true);
        }

        // warmup the JVM
        performTest(null, 10000, false);
        for (int i = 0; i < 5; i++) {
            // warmed up.
            performTest("Hot JVM with a Hot CPU", 25, false);
            // bit cold
            performTest("Hot JVM with a Cold CPU", 25, true);
        }
    }

    public static long performTest(String report, int n, boolean sleep) {
        long time = 0;
        long ret = 0;
        for (int i = 0; i < n; i++) {
            long start = System.nanoTime();
            try {
                ByteArrayOutputStream baos = new ByteArrayOutputStream();
                ObjectOutputStream oos = new ObjectOutputStream(baos);
                oos.writeObject(new Date());
                oos.close();
                ObjectInputStream ois = new ObjectInputStream(new ByteArrayInputStream(baos.toByteArray()));
                Date d = (Date) ois.readObject();
                ret += d.getTime();
                time += System.nanoTime() - start;
                if (sleep) Thread.sleep(100);
            } catch (Exception e) {
                throw new AssertionError(e);
            }
        }
        if (report != null) {
            System.out.printf("%s took %,d us average%n", report, time / n / 1000);
        }
        return ret;
    }
}

Peter Lawrey 2010-01-06 21:01:09

completely agree but am confident this is not the issue in this case as I've done long benchmarks (>24hrs) and repeated benchmarks. This means n * 30 runs of 20mins greater than compilation time and reporting on the aggregated results, i.e. reporting on performance every n s and then summarising the results per period (which means an equivalent stage in the run) with min/max/avg and so on. The results are consistent once compilation has kicked in. Also this is an application benchmark not a synthetic one hence load is representative.

Matt 2010-01-06 21:50:59

Is you CPU hot or cold for the timed period? i.e. does it block or sleep before the timed period?

Peter Lawrey 2010-01-07 05:41:07

Matt 2010-01-07 11:06:43

So if your CPU is typically cold, then you need to benchmark how your application runs from a cold start to get comparable results. I have found latency increases 2-5x in this case. BTW, don't assume there is nothing you can do about it. ;)

Peter Lawrey 2010-01-07 21:38:02

the results are repeatable across many runs when the app is subjected to normal load. It never calls Thread.sleep (except at startup), in general the processing threads will either be active or waiting for work (aka Unsafe.park via a LinkedTransferQueue.take). I may well be missing you your point but your example says to me "do your best to force yourself onto the cpu and never yield to keep things spinning. I don't see how this is viable in a real app unless you have the whole box to a single process, sadly not a luxury I have!

Matt 2010-01-07 23:07:56

also worth bearing in mind that the particularly bit of code I'm concerned about is in the middle of some other processing hence the thread is actively on the cpu (given that we don't overload the box then it's rare to have an active thread bumped) and is doing a variety of (related) work as opposed to just the specific bit I'm talking about here

Matt 2010-01-07 23:10:04

You have to decide whether it is worth dedicating a thread/CPU to a task by busy waiting or not. If you dedicate the CPU, you will get better latency timings, if you don't, the CPU can be used to perform other tasks (from what you say it suggests this isn't really required) If latency is a real concern, you shouldn't be passing work around between threads in any case.

Peter Lawrey 2010-01-09 13:24:53

it's not passing units of work across threads but I do have to balance latency against throughput so tying up a thread like that isn't really feasible.

Matt 2010-01-10 09:38:59

ansaurus

tags:

views:

answers:

Allocation latency seems high, why?

A summary of the comments on assorted posts to try and make things clearer

related questions