The Situation:
I'm optimizing a pure-java implementation of the LZF compression algorithm, which involves a lot of byte[] access and basic int mathematics for hashing and comparison. Performance really matters, because the goal of the compression is to reduce I/O requirements. I am not posting code because it isn't cleaned up yet, and may be restructured heavily.
The Questions:
- How can I write my code to allow it to JIT-compile to a form using faster SSE operations?
- How can I structure it so the compiler can easily eliminate array bounds checks?
- Are there any broad references about the relative speed of specific math operations (how many increments/decrements does it take to equal a normal add/subtract, how fast is shift-or vs. an array access)?
- How can I work on optimizing branching -- is it better to have numerous conditional statements with short bodies, or a few long ones, or short ones with nested conditions?
- With current 1.6 JVM, how many elements must be copied before System.arraycopy beats a copying loop?
What I've already done:
Before I get attacked for premature optimization: the basic algorithm is already excellent, but the Java implementation is less than 2/3 the speed of equivalent C. I've already replaced copying loops with System.arraycopy, worked on optimizing loops and eliminated un-needed operations.
I make heavy use of bit twiddling and packing bytes into ints for performance, as well as shifting & masking.
For legal reasons, I can't look at implementations in similar libraries, and existing libraries have too restrictive license terms to use.
Requirements for a GOOD (accepted) answer:
- Unacceptable answers: "this is faster" without an explanation of how much AND why, OR hasn't been tested with a JIT compiler.
- Borderline answers: have not been tested with anything before Hotspot 1.4
- Basic answers: will provide a general rule and explanation of why it is faster at the compiler level, and roughly how much faster
- Good answers: include a couple of samples of code to demonstrate
- Excellent answers: have benchmarks with both JRE 1.5 and 1.6
- PERFECT answer: Is by someone who worked on the HotSpot compiler, and can fully explain or reference the conditions for an optimization to be used, and how much faster it typically is. Might include java code and sample assembly code generated by HotSpot.
Also: if anyone has links detailing the guts of Hotspot optimization and branching performance, those are welcome. I know enough about bytecode that a site analyzing performance at a bytecode rather than sourcecode level would be helpful.
(Edit) Partial Answer: Bounds-Check Ellimination:
This is taken from supplied link to the HotSpot internals wiki at: http://wikis.sun.com/display/HotSpotInternals/RangeCheckElimination
HotSpot will eliminate bounds checks in all for loops with the following conditions:
- Array is loop invariant (not reallocated within the loop)
- Index variable has a constant stride (increases/decreases by constant amount, in only one spot if possible)
- Array is indexed by a linear function of the variable.
Example: int val = array[index*2 + 5]
OR: int val = array[index+9
NOT: int val = array[Math.min(var,index)+7]
Early version of code:
This is a sample version. Do not steal it, because it is an unreleased version of code for the H2 database project. The final version will be open source. This is an optimization upon the code here: H2 CompressLZF code
Logically, this is identical to the development version, but that one uses a for(...) loop to step through input, and an if/else loop for different logic between literal and backreference modes. It reduces array access and checks between modes.
public int compressNewer(final byte[] in, final int inLen, final byte[] out, int outPos){
int inPos = 0;
// initialize the hash table
if (cachedHashTable == null) {
cachedHashTable = new int[HASH_SIZE];
} else {
System.arraycopy(EMPTY, 0, cachedHashTable, 0, HASH_SIZE);
}
int[] hashTab = cachedHashTable;
// number of literals in current run
int literals = 0;
int future = first(in, inPos);
final int endPos = inLen-4;
// Loop through data until all of it has been compressed
while (inPos < endPos) {
future = (future << 8) | in[inPos+2] & 255;
// hash = next(hash,in,inPos);
int off = hash(future);
// ref = possible index of matching group in data
int ref = hashTab[off];
hashTab[off] = inPos;
off = inPos - ref - 1; //dropped for speed
// has match if bytes at ref match bytes in future, etc
// note: using ref++ rather than ref+1, ref+2, etc is about 15% faster
boolean hasMatch = (ref > 0 && off <= MAX_OFF && (in[ref++] == (byte) (future >> 16) && in[ref++] == (byte)(future >> 8) && in[ref] == (byte)future));
ref -=2; // ...EVEN when I have to recover it
// write out literals, if max literals reached, OR has a match
if ((hasMatch && literals != 0) || (literals == MAX_LITERAL)) {
out[outPos++] = (byte) (literals - 1);
System.arraycopy(in, inPos - literals, out, outPos, literals);
outPos += literals;
literals = 0;
}
//literal copying split because this improved performance by 5%
if (hasMatch) { // grow match as much as possible
int maxLen = inLen - inPos - 2;
maxLen = maxLen > MAX_REF ? MAX_REF : maxLen;
int len = 3;
// grow match length as possible...
while (len < maxLen && in[ref + len] == in[inPos + len]) {
len++;
}
len -= 2;
// short matches write length to first byte, longer write to 2nd too
if (len < 7) {
out[outPos++] = (byte) ((off >> 8) + (len << 5));
} else {
out[outPos++] = (byte) ((off >> 8) + (7 << 5));
out[outPos++] = (byte) (len - 7);
}
out[outPos++] = (byte) off;
inPos += len;
//OPTIMIZATION: don't store hashtable entry for last byte of match and next byte
// rebuild neighborhood for hashing, but don't store location for this 3-byte group
// improves compress performance by ~10% or more, sacrificing ~2% compression...
future = ((in[inPos+1] & 255) << 16) | ((in[inPos + 2] & 255) << 8) | (in[inPos + 3] & 255);
inPos += 2;
} else { //grow literals
literals++;
inPos++;
}
}
// write out remaining literals
literals += inLen-inPos;
inPos = inLen-literals;
if(literals >= MAX_LITERAL){
out[outPos++] = (byte)(MAX_LITERAL-1);
System.arraycopy(in, inPos, out, outPos, MAX_LITERAL);
outPos += MAX_LITERAL;
inPos += MAX_LITERAL;
literals -= MAX_LITERAL;
}
if (literals != 0) {
out[outPos++] = (byte) (literals - 1);
System.arraycopy(in, inPos, out, outPos, literals);
outPos += literals;
}
return outPos;
}
Final edit:
I've marked the best answer so far as accepted, since the deadline is nearly up. Since I took so long before deciding to post code, I will continue to upvote and respond to comments where possible. Apologies if the code is messy: this represented code in development, not polished up for committing.