ansaurus

Question

How can I code Java to allow SSE use and bounds-check elimination (or other advanced optimizations)?

Answer 1

+2 A:

http://wikis.sun.com/display/HotSpotInternals/PerformanceTechniques

Not a GOOD answer, but hopefully helps.

bkail 2009-08-29 21:49:17

I actually didn't know about that -- my understanding of this has been pieced together from about a million different places, plus a lot of trial and error.

BobMcGee 2009-08-29 22:03:06

Answer 2

+1 A:

I guess you already know these, but just in case: check http://wikis.sun.com/display/HotSpotInternals and it's related pages, especially: http://wikis.sun.com/display/HotSpotInternals/PerformanceTechniques

Simon Groenewolt 2009-08-29 21:53:16

It gives some answers, but are there any other sites that provide more specific knowledge on pieces of this?

BobMcGee 2009-09-04 03:59:54

Answer 3

+6 A:

ShuggyCoUk 2009-09-11 22:00:14

Good answer (best so far)! Do you know any sites that provide info about the cycles required for specific assembly commands? Or advice about how bytecode maps to x86 assembly? I don't know that much about assembly.VTune is a little beyond my budget right now (heh), although I'll keep it in mind for the future. I'm working on a version with more predictable array access, and less array fetches. There is zero object allocation, but the branches in code are more complex. Are constant additions (var2 = var1+2, for example) likely to be optimized heavily?

BobMcGee 2009-09-14 15:46:16

cycles per instruction are no longer a meaningful metirc since it is heavily dependent on the interaction with other instructions and dependencies, cache availability, and OOE along with pipeline depth.The jit mapping is *way* more complex that "X byte code => Y x86".sequential array access is good but the branchiness may need to provide > linear speed up for this to pay off significantly.Any compie time constants are folded by the compiler.

ShuggyCoUk 2009-09-14 16:32:25

Addition of a variable constant may not make a significant difference except the compiler may be able to effectively share this value in a register if it is used often. Free profilers are available for java. You should at least look at some of them, http://code.google.com/p/oktech-profiler/ is open source, free and has a sampling mode.

ShuggyCoUk 2009-09-14 16:33:57

I figured the bytecode mapping might be complex -- trying to simplify dependencies here where possible, but not sure how well it will work. Good to know that the additions do not pose problems -- they're just constants offsets relative to changing variables. I've done profiling, but it can't help further -- roughly 99% of runtime lives in the compression loop(s) and System.arraycopy to copy literal runs to output.Can you give any more info about how to approach optimizing for the cache and maximal pipeline depth, from the end of java source?

BobMcGee 2009-09-14 19:14:45

optimizing for cache sizes is simplest buy taking any and all buffers used and tweaking their sizes in relation to the size and associativity of your test machines and investigating the effects. (it's more complex but this can give you a start). pipeline usage is likely to be entirely dependent on the jit unless you can alter the algorithm to do a sort of vectorisation whereby you take something like A1B1C1A2B2C2. where C depends on B which depends on A and do A1A2B1B2C1C2 insteadarraycopy doesn't sound great, why do you have to work in scratch space, can you not work in the output buffer?

ShuggyCoUk 2009-09-14 22:33:42

So, I can't do anything besides optimize buffers to encourage cache use? The only buffers involved are input and output buffers. Bytes are processed in sequence from input, and either used to increase the length of an incompressible run of literals, or a backreference to previous bytes (compression). Once either run has to end, it is written out (using arraycopy for literal runs) to the output buffer.----------------------------------I think this can be vectorized by packing to int/long, but this would make the branching extremely complex and add un-used array accesses.

BobMcGee 2009-09-14 23:05:50

I've posted an *early* version of the code (not the most recent one, which uses a FOR loop to step through the input buffer predictably. That one is still in debugging.

BobMcGee 2009-09-14 23:29:34

The arraycopy for literal runs is likely not something you can beat unless the common literal length is small (I would think at least less than 16bytes, perhaps less). Encouraging good cache use is about making the data you access close, It is likely in your case that the main issue is how you structure your 'look back' strutures based on how they are interogatted

ShuggyCoUk 2009-09-15 11:03:09

Yeah, the arraycopy is actually pretty optimal in modern JVMs. The latest HotSpot releases use hand-tuned assembly here, and are about 2x as fast copying 32 items. I don't think I can optimize look-back structures much, without adding copying to a cache-able small buffer (additional copy cost), or something. I'll see with the restructured loop.

BobMcGee 2009-09-15 16:00:01

Answer 4

+4 A:

As far as bounds check elimination is concerned, I believe the new JDK will already include an improved algorithm that eliminates it, whenever it's possible. These are the two main papers on this subject:

V. Mikheev, S. Fedoseev, V. Sukharev, N. Lipsky. 2002 Effective Enhancement of Loop Versioning in Java. Link. This paper is from the guys at Excelsior, who implemented the technique in their Jet JVM.
Würthinger, Thomas, Christian Wimmer, and Hanspeter Mössenböck. 2007. Array Bounds Check Elimination for the Java HotSpot Client Compiler. PPPJ. Link. Slightly based on the above paper, this is the implementation that I believe will be included in the next JDK. The achieved speedups are also presented.

There is also this blog entry, which discusses one of the papers superficially, and also presents some benchmarking results, not only for arrays but also for arithmetic in the new JDK. The comments of the blog entry are also very interesting, since the authors of the above papers present some very interesting comments and discuss arguments. Also, there are some pointers to other similar blog posts on this subject.

Hope it helps.

JG 2009-09-14 15:35:50

This **IS** very interesting, and yet another reason JDK/JRE 1.7 will be nice, if it ever gets released. Between this, the asynchronous I/O APIs (halfway between stream I/O and NIO), and the new concurrency APIs, it should bring java performance much closer to optimized C. Here, have an upvote for posting something useful (even if not an answer).

BobMcGee 2009-09-14 19:51:45

Answer 5

+2 A:

It's rather unlikely that you need to help the JIT compiler much with optimizing a straightforward number crunching algorithm like LZW. ShuggyCoUk mentioned this, but I think it deserves extra attention:

The cache-friendliness of your code will be a big factor.

You have to reduce the size of your woking set and improve data access locality as much as possible. You mention "packing bytes into ints for performance". This sounds like using ints to hold byte values in order to have them word-aligned. Don't do that! The increased data set size will outweigh any gains (I once converted some ECC number crunching code from int[] to byte[] and got a 2x speed-up).

On the off chance that you don't know this: if you need to treat some data as both bytes and ints, you don't have to shift and |-mask it - use ByteBuffer.asIntBuffer() and related methods.

With current 1.6 JVM, how many elements must be copied before System.arraycopy beats a copying loop?

Better do the benchmark yourself. When I did it way back when in Java 1.3 times, it was somewhere around 2000 elements.

Michael Borgwardt 2009-09-14 20:10:52

This is LZF, not LZW... so speed is essential.I've edited the original post to include the most recent stable version of the compression code, and a link to the earliest version (less optimized). Byte-to-int packing is used for the look-ahead, which is used in the hashtable of positions for 3-byte groups, and for checking if a candidate backref is usable.

BobMcGee 2009-09-14 23:27:22

Also: I am absolutely positive the figure is less than 32 elements now, because it as about 2x as fast using arraycopy vs. loop with this many elements. Then again, the loop wasn't fully optimal.

BobMcGee 2009-09-15 04:44:19

I believe the jit was changed a while back to allow the array copy to be special cased. I can't pull up a reference anywhere apart from http://www.ibm.com/developerworks/java/library/j-devrtj2.html which isn't clear when it changed. Of course this is per JVM as well

ShuggyCoUk 2009-09-15 11:09:37

ah yes - here's some more discussion on this: http://www.mail-archive.com/[email protected]/msg10172.html

ShuggyCoUk 2009-09-15 11:10:22

Answer 6

+1 A:

Lots of answers so far, but couple of additional things:

Measure, measure, measure. As much as most Java developers warn against micro-benchmarking, make sure you compare performance between changes. Optimizations that do not result in measurable improvements are generally not worth keeping (of course, sometimes it's combination of things, and that gets trickier)
Tight loops matter as much with Java as with C (and ditto with variable allocations -- although you don't directly control it, HotSpot will eventually have to do it). I manage to practically double the speed of UTF-8 decoding by rearranging tight loop for handling single-byte case (7-bit ascii) as tight(er) inner loop, leaving other cases out.
Do not underestimate cost of allocating and/or clearing large arrays -- if you want lzf encoding/decoding to be faster for small/medium chunks too (not just megabyte sized), keep in mind that ALL allocations of byte[]/int[] are somewhat costly; not because of GC, but because JVM MUST clear the space.

H2 implementation has also been optimized quite a bit (for example: it does not clear the hash array any more, this often makes sense); and I actually helped modify it for use in another Java project. My contribution was mostly just changing it do be more optimal for non-streaming case, but that did not touch the tight encode/decode loops.

StaxMan 2009-12-08 05:38:50

I'm well aware of the recent H2 optimizations, and am partially responsible: several of the optimizations are actually based on a draft version of my code I sent around the time I posted this. This includes the re-use of the "hash" variable when checking for a possible backref, the cleaner loop structure, and not initializing the hashtable. Out of curiosity, what were your modifications? As a teaser: look for a couple higher-compression options and faster decompression in the H2 code when stuff gets out of code review.

BobMcGee 2009-12-08 06:01:00

ansaurus

tags:

views:

answers:

How can I code Java to allow SSE use and bounds-check elimination (or other advanced optimizations)?

The Situation:

The Questions:

What I've already done:

Requirements for a GOOD (accepted) answer:

(Edit) Partial Answer: Bounds-Check Ellimination:

Early version of code:

Final edit:

related questions