A: 

There are many reasons why this could happen. Without code and/or more details about the data, we can only guess what is happening.

Some Guesses:

  • Maybe you hit the maximum bytes that can be read at a time, thus the IOwaits gets higher or memory consumption up without a decrease in loops.
  • Maybe you hit a critical memory limit, or the JVM is trying to free memory before a new allocation. Try playing around with the -Xmx and -Xms parameters
  • Maybe HotSpot can't/won't optimize, because the number of calls to some methods are too low.
  • Maybe there are OS or Hardware conditions that cause this kind of delay
  • Maybe the implementation of the JVM is just buggy ;-)
Hardcoded
Hehe...Many of these I've speculated on myself, but none really make *total* sense to me. *"Max bytes?"* 256KB is not much, and it behaves differently for direct and non-direct buffers. *"256KB and the JVM memory settings"*? Again, 256KB is small. The discrepancy is fairly consistant no matter how many loops it runs through. *"No hotspot optimizations?"* I've tried different configurations and still the results are consistant. *"OS/HW conditions"* Like what? And why different for direct versus non-direct buffers? Sigh...
Stu Thompson
The JVM may use different OS calls for direct and non-direct buffers, thus resulting in a different runtime behaviour. Non-direct buffers may be slightly larger than direct ones. But the TLAB stuff from Bert sound more like the source of your problem.
Hardcoded
It's not a *"problem"*. Merely an unexpected benchmark result that I would like to accurately understand.
Stu Thompson
BTW: After the above TLAB changes didn't work, I tried `-Xmx` and `-Xms` ...no joy :( The mystery remains.
Stu Thompson
+4  A: 

Thread Local Allocation Buffers (TLAB)

I wonder if the thread local allocation buffer (TLAB) during the test is around 256K. Use of TLABs optimizes allocations from the heap so that the non-direct allocations of <=256K are fast.

What is commonly done is to give each thread a buffer that is used exclusively by that thread to do allocations. You have to use some synchronization to allocate the buffer from the heap, but after that the thread can allocate from the buffer without synchronization. In the hotspot JVM we refer to these as thread local allocation buffers (TLAB's). They work well.

Large allocations bypassing the TLAB

If my hypothesis about a 256K TLAB is correct, then information later in the the article suggests that perhaps the >256K allocations for the larger non-direct buffers bypass the TLAB. These allocations go straight to heap, requiring thread synchronization, thus incurring the performance hits.

An allocation that can not be made from a TLAB does not always mean that the thread has to get a new TLAB. Depending on the size of the allocation and the unused space remaining in the TLAB, the VM could decide to just do the allocation from the heap. That allocation from the heap would require synchronization but so would getting a new TLAB. If the allocation was considered large (some significant fraction of the current TLAB size), the allocation would always be done out of the heap. This cut down on wastage and gracefully handled the much-larger-than-average allocation.

Tweaking the TLAB parameters

This hypothesis could be tested using information from a later article indicating how to tweak the TLAB and get diagnostic info:

To experiment with a specific TLAB size, two -XX flags need to be set, one to define the initial size, and one to disable the resizing:

-XX:TLABSize= -XX:-ResizeTLAB

The minimum size of a tlab is set with -XX:MinTLABSize which defaults to 2K bytes. The maximum size is the maximum size of an integer Java array, which is used to fill the unallocated portion of a TLAB when a GC scavenge occurs.

Diagnostic Printing Options

-XX:+PrintTLAB

Prints at each scavenge one line for each thread (starts with "TLAB: gc thread: " without the "'s) and one summary line.

Bert F
+1 Wow. Thanks. I've never even heard of this stuff. Will experiment and report back.
Stu Thompson
Alas, no joy. :( I tried with values both larger (10MB) and smaller (2KB) and there was no change in the performance curves. But thanks for the educational trip into JVM options.
Stu Thompson
Awww - darn. I guess that why hypothesis need experiments to confirm them. Thanks for checking it out and reporting back. As you say, even a wrong hypothesis can be educational and useful. I learned a lot just confirming my understanding TLABs and writing up the answer.
Bert F
+1  A: 

I suspect that these knees are due to tripping across a CPU cache boundary. The "non-direct" buffer read()/write() implementation "cache misses" earlier due to the additional memory buffer copy compared to the "direct" buffer read()/write() implementation.

Harv
Hmm. Maybe. But firstly, my MBP Core 2 Dou has a 4MB L2 cache. And secondly, I would not expect the direct byte buffer to have the data going across the CPU--it should all be handled via DMA. Not sure how to test for this idea. Hmmmmm....
Stu Thompson
I applied Zach Smith's memory bandwidth "benchmark" (http://home.comcast.net/~fbui/bandwidth.html) on my MBP Core Duo that likewise has a 4MB L2 cache. The tool shows a knee at 1MB. The direct byte buffer does not enable DMA. The direct byte buffer allocates process memory (i.e., malloc()) in the JVM. The JVM file system read()/write() are copying memory to/from system memory into the direct buffer's process memory.
Harv
FWIW, my MBP actually only has a 3MB L2 cache (not 4MB as I previously stated).
Harv