I have a (java) application that runs in a low latency environment, it typically processes instructions in ~600micros (+/- 100). Naturally as we've moved further into the microsecond space the things you see that cost latency change and right now we've noticed that 2/3 of that time is spent in the allocation of the 2 core domain objects.
Benchmarking has isolated the offending sections of code to literally the construction of the objects from existing references, i.e. basically a load of references (~15 in each class) and a couple of lists new'ed, though see note below on exactly what is measured here.
Each one consistently takes ~100micros which is inexplicable to me and I'm trying to track down why. A quick benchmark suggests a similarly sized object full of strings takes about 2-3micros to new up, obviously this sort of benchmark is fraught with difficulty but thought it might be useful as a baseline.
There are 2 Qs here
- how does one investigate this sort of behaviour?
- what explanations are there for slow allocation?
Note the hardware involved is Solaris 10 x86 on Sun X4600s with 8*dual core opterons @ 3.2GHz
Things we've looked at include
- checking the PrintTLAB stats, shows v few slow allocs so there should be no contention there.
- PrintCompilation suggests one of these bits of code is not JIT friendly though Solaris seems to have some unusual behaviour here (viz a vis a modern linux, don't have a linux of similar vintage to solaris10 to bench on right now)
- LogCompilation... bit harder to parse to say the least so that is an ongoing job, nothing obvious so far
- JVM versions... consistent across 6u6 and 6u14, not tried 6u18 or latest 7 yet
Any & all thoughts appreciated
A summary of the comments on assorted posts to try and make things clearer
- the cost I'm measuring is the total cost of creating the object which is built via a Builder (like one of these) and whose private constructor invokes new ArrayList a few times as well as setting references to existing objects. The cost measured covers the cost of setting up the builder and the conversion of the builder to a domain object
- compilation (by hotspot) has a marked impact but it's still relatively slow (compilation in this case takes it from 100micros down to ~60)
- compilation (by hotspot) on my naive benchmark takes allocation time down from ~2micros to ~300ns
- latency does not vary with young gen collection algo (ParNew or Parallel scavenge)