I am doing some optimizations on an MPEG decoder. To ensure my optimizations aren't breaking anything I have a test suite that benchmarks the entire codebase (both optimized and original) as well as verifying that they both produce identical results (basically just feeding a couple of different streams through the decoder and crc32 the outputs).
When using the "-server" option with the Sun 1.6.0_18, the test suite runs about 12% slower on the optimized version after warmup (in comparison to the default "-client" setting), while the original codebase gains a good boost running about twice as fast as in client mode.
While at first this seemed to be simply a warmup issue to me, I added a loop to repeat the entire test suite multiple times. Then execution times become constant for each pass starting at the 3rd iteration of the test, still the optimized version stays 12% slower than in the client mode.
I am also pretty sure its not a garbage collection issue, since the code involves absolutely no object allocations after startup. The code consists mainly of some bit manipulation operations (stream decoding) and lots of basic floating math (generating PCM audio). The only JDK classes involved are ByteArrayInputStream (feeds the stream to the test and excluding disk IO from the tests) and CRC32 (to verify the result). I also observed the same behaviour with Sun JDK 1.7.0_b98 (only that ist 15% instead of 12% there). Oh, and the tests were all done on the same machine (single core) with no other applications running (WinXP). While there is some inevitable variation on the measured execution times (using System.nanoTime btw), the variation between different test runs with the same settings never exceeded 2%, usually less than 1% (after warmup), so I conclude the effect is real and not purely induced by the measuring mechanism/machine.
Are there any known coding patterns that perform worse on the server JIT? Failing that, what options are available to "peek" under the hood and observe what the JIT is doing there?
Maybe I misworded my "warmup" description. There is no explicit warmup code. The entire test suite (consisting of 12 different mpeg streams, containing ~180K audio frames total) is executed 10 times, and I regard the first 3 runs as "warmup". One test round takes approximately 40 seconds of 100% cpu on my machine.
I played with the JVM options as suggested and using "-Xms512m -Xmx512m -Xss128k -server -XX:CompileThreshold=1 -XX:+PrintCompilation -XX:+AggressiveOpts -XX:+PrintGC" I could verify that all compilation takes place in the first 3 rounds. Garbage collection kicks in every 3-4 rounds and took 40ms at most (512m is extremely oversized, since the tests can be run with 16m just fine). From this I conclude that garbage collection has no impact here. Still, comparing client to server (other options unaltered) the 12/15% difference remains.