views:

146

answers:

6

I am decomposing a single long into a long[] of single bit longs by

public static int decompose(long[] buffer, long base) {
  int count = Long.bitCount(base);
  for (int i=0; i<count; i++) {
    base ^= (buffer[i] = Long.lowestOneBit(base));
  }
  return count;
}

but I feel like there might be a faster way to do this, since it seems like there are some repeated steps. For example, counting the bits should already come pretty close to getting all the info needed to populate the result.

Any suggestions? I'm familiar with the premature optimization mantra, hence why I'm not pushing my solution any further on my time, but perhaps someone else has seen this before or is addicted to optimization...EDIT: please run any suggestions through the test harness Mark provided below. I am more than a bit surprised that my first inkling is actually holding up.

test code, which is inside a JUnit method:

Random rng = new Random();
long size = 0;
long[] hold = new long[Long.SIZE];
System.out.println("init:"+Long.toBinaryString(BitTwiddling.bitmask(rng.nextInt(Long.SIZE)))); //initialize BitTwiddling internals
long start = System.currentTimeMillis();
for (int i=0; i<compareIterations; i++) size+= BitTwiddling.decompose(hold,rng.nextInt(Integer.MAX_VALUE));
long delta = System.currentTimeMillis() - start;
double base = (double)size/delta;

size = 0;
start = System.currentTimeMillis();
for (int i=0; i<compareIterations; i++) size += BitTwiddling.decomposeAlt(hold, rng.nextInt(Integer.MAX_VALUE));
long delta2 = System.currentTimeMillis() - start;
double base2 = (double)size/delta2;

then compare base and base2

+2  A: 

Since I love optimization*, here's a version you might try instead:

public static int decompose(long[] buffer, long base) {
    int count = 0;
    while (base != 0) {
        long mask = base & (-base);
        base &= ~mask;
        buffer[count++] = mask;
    }
    return count;
}

Main things I did were:

  • Inline the lowest one bit calculation to avoid a method call overhead. This can be a win in the (rare but possible) cases where the compiler/JVM isn't clever enough to do it for you....
  • Pass an array of longs as an input parameter to avoid a memory allocation and the need to count the bits to determine the array size. Function now returns the number of bitmasks found.
  • Zero the bits as you go along, so that you can bail out as early as possible from the loop

*because it's fun, regardless of whether it is premature or not

mikera
after some comparisons (had to make the original more apples-to-apples by eliminating the buffer allocation), the answer I have above performs within a percent of your answer. Eliminating the allocation improved speed by about a fifth, but that's only going to be useful when the buffer can be reused.
Carl
also, rolling the while loop up into a single statement (`while ((base ` seems to improve performance a bit (a few percent over a minutes worth of running the method). Perhaps because the `mask` assignment is eliminated?
Carl
+1  A: 

I'd use a mask, calculated at each loop with the shift operator:

for (int i= 0; i < result.length; i++)
    result[i]= base & (1<<i);

Should be both clear and fast.

Samuel_xL
that doesn't work for arbitrary bit locations within `base` (e.g. 100101101 has a `result` length of 5).
Carl
+2  A: 

Here's the harness I'm using at the moment. I consistently get the code in the question (Method 1) to perform significantly faster. Currently Mark Peters' answer is the fastest normally, but Carl's is faster with the -server flag set. Also, mikera's gets considerably more competitive in server mode (though still slower than Carl/Mark's).

import java.util.Random;

public abstract class Harness {
    public static void main(String[] args) {
        while (true) {
            long[] randomData = new long[4096];
            Random r = new Random();
            for (int i = 0; i < randomData.length; i++) {
                randomData[i] = r.nextLong();
            }
            long[] buffer = new long[64];

            Harness[] versions = new Harness[] {
                    new Control(randomData, buffer),
                    new Carl(randomData, buffer),
                    new MarkPeters(randomData, buffer),
                    new MarkPetersServer64Bit(randomData, buffer),
//                    new Rsp1(randomData, buffer), 
                    new Rsp2(randomData, buffer),
                    new Mikera(randomData, buffer)
                    };
            for (Harness v : versions) {
                v.doTest();
            }
        }
    }

    private final long[] buffer;
    private final long[] randomData;

    protected Harness(long[] randomData, long[] buffer) {
        this.randomData = randomData;
        this.buffer = buffer;
    }

    public void doTest() {
        long start = System.nanoTime();
        long count = 0;
        for (int times = 0; times < 1000; times++) {
            for (int i = 0; i < randomData.length; i++) {
                count = decompose(buffer, randomData[i]);
            }
        }
        long end = System.nanoTime();

        //verify decomposition of last item
        long l = 0;
        for ( int i = 0; i < count; i++ ) {
            l |= buffer[i];
        }

        System.out.println(getClass().getSimpleName() + " took " + (end - start)
                / 1000000.0 + " ms - last base: " + l);
    }

    protected abstract int decompose(long[] buffer, long base);

    private static class Control extends Harness {
        protected Control(long[] randomData, long[] buffer) {
            super(randomData, buffer);
        }

        @Override
        protected int decompose(long[] buffer, long base) {
            return 0;
        }
    }

    private static class Carl extends Harness {
        protected Carl(long[] randomData, long[] buffer) {
            super(randomData, buffer);
        }

        @Override
        protected int decompose(long[] buffer, long base) {
            final int count = Long.bitCount(base);
            for (int i = 0; i < count; i++) {
                base ^= (buffer[i] = Long.lowestOneBit(base));
            }
            return count;
        }
    }

    private static class Mikera extends Harness {
        protected Mikera(long[] randomData, long[] buffer) {
            super(randomData, buffer);
        }

        @Override
        protected int decompose(long[] buffer, long base) {
            int count = 0;
            while (base != 0) {
                long mask = base & (-base);
                base &= ~mask;
                buffer[count++] = mask;
            }
            return count;
        }
    }

    private static class Rsp1 extends Harness {
        protected Rsp1(long[] randomData, long[] buffer) {
            super(randomData, buffer);
        }

        @Override
        protected int decompose(long[] buffer, long base) {
            int count = 0;

            if (0 != (base & 1)) {
                buffer[count++] = 1;
            }

            base >>>= 1;

            for (long bit = 1; 0 < bit && bit <= base; ) {
                if (0 < (base & bit)) {
                    buffer[count++] = (bit <<= 1);
                }
            }
            return count;
        }
    }

    private static class Rsp2 extends Harness {
        protected Rsp2(long[] randomData, long[] buffer) {
            super(randomData, buffer);
        }

        @Override
        protected int decompose(long[] buffer, long base) {
            int count = 0;

            for (long bit = 1; 0 != base; bit <<= 1, base >>>= 1) {
                if (0 < (base & 1)) {
                    buffer[count++] = bit;
                }
            }
            return count;
        }
    }

    private static class MarkPeters extends Harness {
        protected MarkPeters(long[] randomData, long[] buffer) {
            super(randomData, buffer);
        }

        @Override
        protected int decompose(long[] buffer, long base) {
            int count = Long.bitCount(base);
            for (int i = 0; i < count; i++) {
                base -= (buffer[i] = Long.lowestOneBit(base));
            }
            return count;

        }
    }

    private static class MarkPetersServer64Bit extends Harness {
        protected MarkPetersServer64Bit(long[] randomData, long[] buffer) {
            super(randomData, buffer);
        }

        @Override
        protected int decompose(long[] buffer, long base) {
            int count = 0;
            while ( 0 != (base ^= (buffer[count++] = Long.lowestOneBit(base))));
            return count;
        }
    }
}

Sample output

Which method is best depends on the situation.

Non-server, 32-bit JVM

Control took 41.175272 ms - last base: 0
Carl took 691.966919 ms - last base: 5852835112840111303
MarkPeters took 642.230253 ms - last base: 5852835112840111303
MarkPetersServer64Bit took 742.594626 ms - last base: 5852835112840111303
Rsp2 took 3886.203787 ms - last base: 5852835112840111303
Mikera took 1044.451494 ms - last base: 5852835112840111303

Winner: MarkPeters

Server, 32-bit JVM

Control took 2.354383 ms - last base: 0
Carl took 508.687401 ms - last base: 338317437500027646
MarkPeters took 521.831297 ms - last base: 338317437500027646
MarkPetersServer64Bit took 727.052206 ms - last base: 338317437500027646
Rsp2 took 3811.75662 ms - last base: 338317437500027646
Mikera took 665.252599 ms - last base: 338317437500027646

Winner: Carl

Server (implicit or explicit), 64-bit JVM

Control took 0.007186 ms - last base: 0
Carl took 543.701859 ms - last base: -8898262206218882664
MarkPeters took 439.706079 ms - last base: -8898262206218882664
MarkPetersServer64Bit took 391.831055 ms - last base: -8898262206218882664
Rsp2 took 1861.40449 ms - last base: -8898262206218882664
Mikera took 435.964319 ms - last base: -8898262206218882664

Winner: MarkPetersServer64Bit

Note: there is no comparable 64-bit JVM that can run in non-server mode to my knowledge. See here:

Are both -client and -server VM modes available in 64-bit Java?

Currently only the Java HotSpot Server VM supports 64-bit operation, and the -server option is implicit with the use of -d64. This is subject to change in a future release.

Mark Peters
Thanks for posting this; the one question I have - since `buffer` is never read (only overwritten), might the compiler optimize away any side-effects to it? or is that accounted for by comparing to the control method?
Carl
@Carl: I made a few changes including reading the buffer after the last small iteration and randomizing the data between big iterations. I don't think anything can be feasibily optimized away now. And yes, the control is a good indication nothing is.
Mark Peters
+1  A: 

How about this version?

public static int decompose(long[] buffer, long base) {
    int count = 0;

    if (0 != (base & 1)) {
        buffer[count++] = 1;
    }

    base >>>= 1;

    for (long bit = 1; 0 < bit && bit <= base; ) {
        if (0 < (base & bit)) {
            buffer[count++] = (bit <<= 1);
        }
    }
    return count;
}

The sign bit of base complicates the end condition in this case, so we prevent this from happening by "adjusting" the base value at the start.

Alternatively instead of checking the bitmask versus base, we can shift base itself:

public static int decompose(long[] buffer, long base) {
    int count = 0;

    for (long bit = 1; 0 != base; bit <<= 1, base >>>= 1) {
        if (0 < (base & 1)) {
            buffer[count++] = bit;
        }
    }
    return count;
}
rsp
I'd add this to the test but it doesn't seem to ever terminate...unless it's really really bad. I think you must have an infinite loop or something.
Mark Peters
@Mark Peters, yeah you're correct; the 1st version does not terminate for the case that the highest bit is set and buffer is not exhausted... The 2nd version does not have that problem. I'll edit the 1st to get it to terminate.
rsp
Actually, I use the sign bit, so that comparison doesn't quite work.
Carl
@Carl: I added it to the tests but as you can see in the output you get the wrong results every so often.
Mark Peters
@Carl, the sign bit should not be a problem in the lastest versions.
rsp
@rsp the first still seems to have accuracy issues; the second returns the right results.
Mark Peters
@Mark Peters, I think I fixed that accuracy problem (serves me right for trying this without an eclipse or JDK present :-)) I will have another look tomorow when I will have eclipse around.
rsp
+2  A: 

OK after trying long and hard to beat your original code, I found one that consistently beats yours (in my environment) by shamelessly stealing your code and tweaking it.

    protected int decompose(long[] buffer, long base) {
        int count = Long.bitCount(base);
        for (int i = 0; i < count; i++) {
            base -= (buffer[i] = Long.lowestOneBit(base));
        }
        return count;
    }

Only difference: uses subtraction instead of xor. I don't think there are any edge cases; I certainly didn't find any. No idea why those aren't optimized to the same thing (assuming there aren't edge cases), but it does make mine consistently faster (again, on my machine). I've added this version to the test answer.

Edit I guess the reason it can't be optimized is because the compiler doesn't know that the operand will always be less than base (since we're going from lower to higher bits).

Edit #2 With the -server flag set, Carl's is consistently marginally faster than mine again.

Edit #3 Yep sure enough, when run on a 64 bit JVM this is significantly faster:

    protected int decompose(long[] buffer, long base) {
        int count = 0;
        while ( 0 != (base -= (buffer[count++] = Long.lowestOneBit(base))));
        return count;
    }

(Or the equivalent xor version), because it can perform 64-bit comparisons using native processor operations.

Mark Peters
I managed to up mine a bit by compacting the for-loop as well (consistently a few 10s of ms faster) - no clue why, though: `for (int i=0; i<count; base ^= (buffer[i++] = Long.lowestOneBit(base)));` - I imagine the same would apply for `-=`
Carl
also, no shame in stealing - that's what I was hoping to do with my question!
Carl
@Carl: that doesn't seem to make a difference in server mode. Also I was trying to find out why pre-computing the count using `Long.bitCount()` was better than something like `while ( 0 != (base -= (buffer[count++] = Long.lowestOneBitSet(base))) );`. Finally figured out it was because the latter requires long comparison whereas yours requires int comparison. Didn't think that would compensate for the needless calculation.
Mark Peters
@Carl: yep I confirmed my comment above and found out that on a 64-bit JVM foregoing the `Long.bitCount()` call saves time because long, 64-bit comparisons can be performed natively on 64-bit architectures.
Mark Peters
Carl
@Carl: Depends on if you're running in server mode. Mikera's is pretty competitive (neither consistently won) with 64-bit non-server, but using xor plus the long comparisons (while loop) seems best for 64-bit server. See the benchmark answer. **Edit:** Oops I should clarify when I moved my edit to the benchmark I used XOR instead of subtraction.
Mark Peters
Carl
@Carl: Just tried it, it doesn't seem to make the slightest bit of difference, even without `-server`. But then I just remembered that only the server JVM supports 64-bit operation, so server is always enabled implicitly; the arg changes nothing.
Mark Peters
A: 

It seems to me something very simple like this would be (and is) quite fast...

public static int decompose(long[] buffer, long base) {
    int count = 0;
    long mask = 1;

    for (; base != 0; base >>>= 1) {
        if ((base & 1) == 1) {
            buffer[count++] = mask;
        }
        mask <<= 1;
    }
    return count;
}
Mark E
using the test harness in one of the other answers, this method takes about 4 times as long as the other ones.
Carl
@Carl, took out the Math.abs call, should be a fair bit better now.
Mark E
still takes about 4x. try (1) modifying just one variable other than buffer/counter instead of two, and (2) looking at fewer positions (right now, you only avoid the leading zeros).
Carl