views:

218

answers:

5

I have a (java) application that runs in a low latency environment, it typically processes instructions in ~600micros (+/- 100). Naturally as we've moved further into the microsecond space the things you see that cost latency change and right now we've noticed that 2/3 of that time is spent in the allocation of the 2 core domain objects.

Benchmarking has isolated the offending sections of code to literally the construction of the objects from existing references, i.e. basically a load of references (~15 in each class) and a couple of lists new'ed, though see note below on exactly what is measured here.

Each one consistently takes ~100micros which is inexplicable to me and I'm trying to track down why. A quick benchmark suggests a similarly sized object full of strings takes about 2-3micros to new up, obviously this sort of benchmark is fraught with difficulty but thought it might be useful as a baseline.

There are 2 Qs here

  • how does one investigate this sort of behaviour?
  • what explanations are there for slow allocation?

Note the hardware involved is Solaris 10 x86 on Sun X4600s with 8*dual core opterons @ 3.2GHz

Things we've looked at include

  • checking the PrintTLAB stats, shows v few slow allocs so there should be no contention there.
  • PrintCompilation suggests one of these bits of code is not JIT friendly though Solaris seems to have some unusual behaviour here (viz a vis a modern linux, don't have a linux of similar vintage to solaris10 to bench on right now)
  • LogCompilation... bit harder to parse to say the least so that is an ongoing job, nothing obvious so far
  • JVM versions... consistent across 6u6 and 6u14, not tried 6u18 or latest 7 yet

Any & all thoughts appreciated

A summary of the comments on assorted posts to try and make things clearer

  • the cost I'm measuring is the total cost of creating the object which is built via a Builder (like one of these) and whose private constructor invokes new ArrayList a few times as well as setting references to existing objects. The cost measured covers the cost of setting up the builder and the conversion of the builder to a domain object
  • compilation (by hotspot) has a marked impact but it's still relatively slow (compilation in this case takes it from 100micros down to ~60)
  • compilation (by hotspot) on my naive benchmark takes allocation time down from ~2micros to ~300ns
  • latency does not vary with young gen collection algo (ParNew or Parallel scavenge)
+1  A: 

Memory allocation can cause side-effects. Is it possible that the memory allocation is causing the heap to get compacted? Have you looked to see if your memory allocation is causing the GC to run at the same time?

Have you separately timed how long it takes to create the new ArrayLists?

RickNZ
+1  A: 

There is probably no hope in expecting microsecond-latency guarantees from a general-purpose VM running on a general-purpose OS, even with such great hardware. Massive throughput is the best you can hope for. How about switching to a real-time VM if you need one (I'm talking RTSJ and all that...)

...my two cents

Gyom
Matt
+2  A: 

Since your question was more about how to go about investigating the problem rather than 'what is my problem' I'll stick with some tools to try out.

A very useful tool for getting a better idea of what is going on and when is BTrace. It is similar to DTrace however a pure java tool. On that note I'm assuming you know DTrace, if not that is also useful if not obtuse. These will give you some visibility about what is happening and when in the JVM and OS.

Oh, another thing to clarify in your original posting. What collector are you running? I'm assuming with high latency problem that you are using a low pause collector like CMS. If so have you tried any tuning?

reccles
Matt
+1  A: 

Just some wild guesses:

It's my understanding that Java VMs handle the memory of short lived objects differently from long-term objects. It would seem reasonable to me that at the point where an object goes from having one function-local reference to having references in the global heap would be a big event. Instead of being available for cleanup at function exit, it now has to be tracked by the GC.

Or it could be that going from one reference to multiple references to a single object has to change the GC accounting. As long as an object has a single reference it is easy to clean up. Multiple references can have reference loops and/or the GC may have to search for the reference in all other objects.

Zan Lynx
Matt
Matt: regarding reference counting: there is reference counting, and there is reference counting... A GC that knows the difference between 1 and Many references can make things much easier on itself. When cleaning objects that hold the one and only reference to another object, that object can also be immediately collected.
Zan Lynx
right but the cost (or saving) in that case would be at collect time as opposed to allocation time
Matt
+1  A: 

When you are repeating the same task many times, your CPU tends to run very efficiently. This is because your cache miss times and warm up of the CPU doesn't appear as a factor. Its is also possible you are not considering your JVM warm time either.

If you try the same thing when the JVM and/or the CPU is not warmed up. You will get a very different results.

Try doing the same thing say 25 times (Less than your compile threshold) and sleep(100) between tests. You should expect to see much higher times, closer to what you see in the real application.

The behaviour of your app will differ but to illustrate my point. I have found waiting for IO can be more disruptive than a plain sleep.

When you performing your benchmark, you should try to ensure you are comparing like with like.

import java.io.*;
import java.util.Date;

/**
Cold JVM with a Hot CPU took 123 us average
Cold JVM with a Cold CPU took 403 us average
Cold JVM with a Hot CPU took 314 us average
Cold JVM with a Cold CPU took 510 us average
Cold JVM with a Hot CPU took 316 us average
Cold JVM with a Cold CPU took 514 us average
Cold JVM with a Hot CPU took 315 us average
Cold JVM with a Cold CPU took 545 us average
Cold JVM with a Hot CPU took 321 us average
Cold JVM with a Cold CPU took 542 us average
Hot JVM with a Hot CPU took 44 us average
Hot JVM with a Cold CPU took 111 us average
Hot JVM with a Hot CPU took 32 us average
Hot JVM with a Cold CPU took 96 us average
Hot JVM with a Hot CPU took 26 us average
Hot JVM with a Cold CPU took 80 us average
Hot JVM with a Hot CPU took 26 us average
Hot JVM with a Cold CPU took 90 us average
Hot JVM with a Hot CPU took 25 us average
Hot JVM with a Cold CPU took 98 us average
 */
public class HotColdBenchmark {
    public static void main(String... args) {
        // load all the classes.
        performTest(null, 25, false);
        for (int i = 0; i < 5; i++) {
            // still pretty cold
            performTest("Cold JVM with a Hot CPU", 25, false);
            // still pretty cold
            performTest("Cold JVM with a Cold CPU", 25, true);
        }

        // warmup the JVM
        performTest(null, 10000, false);
        for (int i = 0; i < 5; i++) {
            // warmed up.
            performTest("Hot JVM with a Hot CPU", 25, false);
            // bit cold
            performTest("Hot JVM with a Cold CPU", 25, true);
        }
    }

    public static long performTest(String report, int n, boolean sleep) {
        long time = 0;
        long ret = 0;
        for (int i = 0; i < n; i++) {
            long start = System.nanoTime();
            try {
                ByteArrayOutputStream baos = new ByteArrayOutputStream();
                ObjectOutputStream oos = new ObjectOutputStream(baos);
                oos.writeObject(new Date());
                oos.close();
                ObjectInputStream ois = new ObjectInputStream(new ByteArrayInputStream(baos.toByteArray()));
                Date d = (Date) ois.readObject();
                ret += d.getTime();
                time += System.nanoTime() - start;
                if (sleep) Thread.sleep(100);
            } catch (Exception e) {
                throw new AssertionError(e);
            }
        }
        if (report != null) {
            System.out.printf("%s took %,d us average%n", report, time / n / 1000);
        }
        return ret;
    }
}
Peter Lawrey
completely agree but am confident this is not the issue in this case as I've done long benchmarks (>24hrs) and repeated benchmarks. This means n * 30 runs of 20mins greater than compilation time and reporting on the aggregated results, i.e. reporting on performance every n s and then summarising the results per period (which means an equivalent stage in the run) with min/max/avg and so on. The results are consistent once compilation has kicked in. Also this is an application benchmark not a synthetic one hence load is representative.
Matt
Is you CPU hot or cold for the timed period? i.e. does it block or sleep before the timed period?
Peter Lawrey
Matt
So if your CPU is typically cold, then you need to benchmark how your application runs from a cold start to get comparable results. I have found latency increases 2-5x in this case. BTW, don't assume there is nothing you can do about it. ;)
Peter Lawrey
the results are repeatable across many runs when the app is subjected to normal load. It never calls Thread.sleep (except at startup), in general the processing threads will either be active or waiting for work (aka Unsafe.park via a LinkedTransferQueue.take). I may well be missing you your point but your example says to me "do your best to force yourself onto the cpu and never yield to keep things spinning. I don't see how this is viable in a real app unless you have the whole box to a single process, sadly not a luxury I have!
Matt
also worth bearing in mind that the particularly bit of code I'm concerned about is in the middle of some other processing hence the thread is actively on the cpu (given that we don't overload the box then it's rare to have an active thread bumped) and is doing a variety of (related) work as opposed to just the specific bit I'm talking about here
Matt
You have to decide whether it is worth dedicating a thread/CPU to a task by busy waiting or not. If you dedicate the CPU, you will get better latency timings, if you don't, the CPU can be used to perform other tasks (from what you say it suggests this isn't really required) If latency is a real concern, you shouldn't be passing work around between threads in any case.
Peter Lawrey
it's not passing units of work across threads but I do have to balance latency against throughput so tying up a thread like that isn't really feasible.
Matt