views:

155

answers:

2

Hi, I've got a bit of a problem. I wanted to use scala.concurrent.ops.replicate to parallelize my program. But I found out, that the algorithm actually becomes much slower. So I wrote a little test and still got the same result. So here they are.

Serial Code: Takes about 63 seconds to finish

object SerTest {
  def main(args: Array[String]) {
      for(x <- 1 to 10){
        for(i <- 1 to 4) {
          for(j <- 1 to 100000) {
            val a = BigInt(j).isProbablePrime(1000)
            if(!a && j == 100000) println(i + " is ready")}}}}}

Concurrent Code: Takes about 161 seconds to finish

object ParTest {
  def main(args: Array[String]) {
      for(x <- 1 to 10){
        replicate(1,5) { i =>
          for(j <- 1 to 100000) {
            val a = BigInt(j).isProbablePrime(1000)
            if(!a && j == 100000) println(i + " is ready")}}}}}

So where is the completely obvious and embarrassing error I made? :)

Edit: Ohh, and I am running this on a Quadcore-CPU. So it should actually be faster :)

Edit2: Because of the answer of Kevin Wright I changed the programs slightly to have a longer time to run.

+2  A: 

Looking at your sample code, I'd guess that you're jumping straight into the main method from the command line. This is the absolute worst way that you can go about microprofiling in Java!

You should first run your test for a handful of times (within the same VM invocation), at least enough so that the JVM has been properly warmed up and running for a good 30 seconds before you even think about starting to measure anything. This will ensure that it's running compiled (and not interpreted) code, and that it's been fully optimised.

You also need to be aware of the cost of starting up threads. For short-running loops, this will be a prohibitive overhead, and will consume more time than the loop itself!

update

The following definitions come from ops.scala:

val defaultRunner: FutureTaskRunner = TaskRunners.threadRunner
def spawn(p: => Unit)(implicit runner: TaskRunner = defaultRunner): Unit = {...}
def replicate(start: Int, end: Int)(p: Int => Unit) {...}

So the actual runner used is injected as an implicit, or defaults to TaskRunners.threadRunner

You can try changing this to use a thread pool by prefixing your code with:

implicit val runner = TaskRunners.threadPoolRunner

Or I believe the following will also work:

import concurrent.TaskRunners.threadPoolRunner

See if that makes any difference


On second thoughts...

I don't think that parameter is actually going to get passed through to the nested call to spawn, probably better if you just duplicate the method yourself (I currently have a query about this posted on the mailing lists).

For your convenience, here's the method in its full, terrifying, glory:

def replicate(start: Int, end: Int)(p: Int => Unit) {
  if (start == end) 
    ()
  else if (start + 1 == end)
    p(start)
  else {
    val mid = (start + end) / 2
    spawn { replicate(start, mid)(p) }
    replicate(mid, end)(p)
  }
}

(you still need to define the implicit runner...)

Kevin Wright
OK, I've done that but it doesn't change much. If I run the outer loop 10 times the serial program takes 63 seconds, the parallel one 161 seconds. Regarding the startup-time: I only start 4 threads because I only have 4 Cores and every inner loop takes over a second to execute. So I think I am giving the program the best conditions to be theoretically much faster on multiple cores.
Plankalkül
You ran it 10 times, then measured the 11th? How consistent are the results across multiple runs?
Kevin Wright
No I measured all 10 runs. I get where you are going with this because I still got all the setup and JIT-Time in there, and I would agree if it were a slight difference. But 63 against 161 seconds ... There is definitly a much huger problem there.
Plankalkül
OK, so I also tested all runs separately and the results even out after about the 3. execution. And then I get 5.5 sec for the serial code and about 18 sec for the concurrent code. What is interesting: The time of the concurrent code is pretty stable from the first loop. So maybe the JIT doesn't optimize it?
Plankalkül
Updated my answer, see if it helps any...
Kevin Wright
Thanks, but that doesn't change anything either. In the meantime I also used a profiler to look at the problem and while I am not very experienced with such a tool, to me it looks like the 4 threads that are being created block each other most of the time.
Plankalkül
One last try? :)
Kevin Wright
No, doesn't work either. I even tried to copy spawn and set the default to threadPoolRunner. But thanks for all the effort :)
Plankalkül
"thanks" = upvote :)
Kevin Wright
With pleasure :)
Plankalkül
+3  A: 

Take a look at the source for BigInteger.isProbablePrime (BigInt delegates to the java library). It's doing a serious amount of new BigInteger() since that's an immutable class.

My guess is that the memory allocation is causing too much contention to benefit from parallelization. You can probably confirm by substituting a simple calculation (like multiplying a 100MM numbers together) for your prime test. Or, rewrite the prime test using var longs instead of BigInt.

Also, ops.replicate spawns operations into new threads rather than utilizing some sort of thread pool. Thread creation has a certain amount of overhead, but not enough to be a problem in this case. I personally prefer to stick with the more robust java.util.concurrent libraries.

Jon Hoffman
Thanks for tip. I looked into it and don't think memory allocation is the problem. Even if all I do instead of ´isProbablePrime´ is to allocate a 52 MByte Int-Array with a constant written into it the parallel program is still about 30% faster then the serial one. Far away from beeing 2.5 times slower.
Plankalkül
are you doing the allocation inside the parallel block?
Jon Hoffman
i see a linear speedup up to the number of cores on my machine if i do: val a = isPrime(j) where: def isPrime(n:Int): Boolean = { for(i <- 2 to n/2) { if (n % i == 0) return false } true}
Jon Hoffman
Yes I am doing the allocation inside the block. I can even see the MemoryUsage jumping up and down and the parallel program definitely uses more memory than the serial one. (Which would be expected)
Plankalkül
my point was that you should NOT do allocations in the parallel block if you want to see linear speedups.
Jon Hoffman
Yes, and my point is: Even if I do so heavily the performance is way, way better than what I see with isProbablePrime which means: The allocation is not the problem but something else is. I don't search for a method to calculate primes very fast, I want to understand why the example with isProbablePrime is so slow.
Plankalkül
Well I never...
Kevin Wright