views:

69

answers:

1

I am doing some optimizations on an MPEG decoder. To ensure my optimizations aren't breaking anything I have a test suite that benchmarks the entire codebase (both optimized and original) as well as verifying that they both produce identical results (basically just feeding a couple of different streams through the decoder and crc32 the outputs).

When using the "-server" option with the Sun 1.6.0_18, the test suite runs about 12% slower on the optimized version after warmup (in comparison to the default "-client" setting), while the original codebase gains a good boost running about twice as fast as in client mode.

While at first this seemed to be simply a warmup issue to me, I added a loop to repeat the entire test suite multiple times. Then execution times become constant for each pass starting at the 3rd iteration of the test, still the optimized version stays 12% slower than in the client mode.

I am also pretty sure its not a garbage collection issue, since the code involves absolutely no object allocations after startup. The code consists mainly of some bit manipulation operations (stream decoding) and lots of basic floating math (generating PCM audio). The only JDK classes involved are ByteArrayInputStream (feeds the stream to the test and excluding disk IO from the tests) and CRC32 (to verify the result). I also observed the same behaviour with Sun JDK 1.7.0_b98 (only that ist 15% instead of 12% there). Oh, and the tests were all done on the same machine (single core) with no other applications running (WinXP). While there is some inevitable variation on the measured execution times (using System.nanoTime btw), the variation between different test runs with the same settings never exceeded 2%, usually less than 1% (after warmup), so I conclude the effect is real and not purely induced by the measuring mechanism/machine.

Are there any known coding patterns that perform worse on the server JIT? Failing that, what options are available to "peek" under the hood and observe what the JIT is doing there?

  • Maybe I misworded my "warmup" description. There is no explicit warmup code. The entire test suite (consisting of 12 different mpeg streams, containing ~180K audio frames total) is executed 10 times, and I regard the first 3 runs as "warmup". One test round takes approximately 40 seconds of 100% cpu on my machine.

  • I played with the JVM options as suggested and using "-Xms512m -Xmx512m -Xss128k -server -XX:CompileThreshold=1 -XX:+PrintCompilation -XX:+AggressiveOpts -XX:+PrintGC" I could verify that all compilation takes place in the first 3 rounds. Garbage collection kicks in every 3-4 rounds and took 40ms at most (512m is extremely oversized, since the tests can be run with 16m just fine). From this I conclude that garbage collection has no impact here. Still, comparing client to server (other options unaltered) the 12/15% difference remains.

+4  A: 

As you've seen, the JIT can skew test results, since it runs in a background thread, stealing cpu cycles from your main thread running the test.

As well as stealing cycles, it's also asynchornos, so you cannot be sure it has finished it's work when you complete warmup and start your test for real. To force synchronous JIT compilation, You can use the -XBatch nonstandard option to force JIT compilation to the foreground thread, so you can be sure the JIT has finished when your warmup completes.

HotSpot doesn't compile methods right away, but waits until a method has been executed a certain number of times. On the page for the -XX options, it states that the default for -server is 10000 times, while for -client it is 1500 times. This could be a cause of the slowdown, particularly if your warmup ends up invoking many critical methods between 1500 and 10000 times: with the -client option they will be JITed during the warmup phase, but running with -server, compilation may be delayed execution of your profiled code.

You can change the number of method invocations needed before HotSpot compiles a method by setting -XX:CompileThreshold. I choose twenty so that even vaguely hot-spots (luke-warm spots?) are converted during the warmup even when the test is run just a few times. This has worked for me in the past, but YMMV and different values may give you better results.

You might also check the HotSpot VM Options page to find the other options that differ between -client and -server options, particularly the garbage collector options, as these differ considerably.

See

mdma
Well since i asked what could possibly affect the performance this is a reasonable answer, so i'll accept it. Still i'm pretty sure that the described warmup/gc issues are not whats causing my specific measurements. Any further hints would still be apreciated.
Durandal
You say the original version runs twice as fast with -server than with -client, yet the optimized code is 12% slower with -server than with -client. I can't give you much advice unless you detail specifically the optimizations you've applied. I recommend profiling, see where the code is slower between -client and -server and then post a new question listing the changes - you'll then hopefully get a more specific answer. Good luck!
mdma