views:

1054

answers:

9

In one of my first attempts to create functional code, I ran into a performance issue.

I started with a common task - multiply the elements of two arrays and sum up the results:

var first:Array[Float] ...
var second:Array[Float] ...    
var sum=0f; 
for(ix<-0 until first.length) sum += first(ix) * second(ix);

Here is how I reformed the work:

sum = first.zip(second).map{ case (a,b) => a*b }.reduceLeft(_+_)

When I benchmarked the two approaches, the second method takes 40 times as long to complete!

Why does the second method take so much longer? How can I reform the work to be both speed efficient and use functional programming style?

+5  A: 
dbyrne
This version seems to allocate a list with the elements from `a1`, and an array with the pointwise products. I fail to see where the significant difference with the original version is.
Pascal Cuoq
Agree with your comment. I've rewritten my code.
dbyrne
+1 for the second version.
Pascal Cuoq
I didn't think Scala could do tail-call optimizations, since it runs on the JVM?
aioobe
It can: http://fupeg.blogspot.com/2009/04/tail-recursion-in-scala.html
dbyrne
In scala 2.8 there is even an annotation to assert that your code is tail-recursive (@tailrec)
dbyrne
@dbyme Could you tell how much time did the original imperative version take in your machine as well?
Daniel
+12  A: 

This is a microbenchmark, and it depends on how the compiler optimizes you code. You have 3 loops composed here,

zip . map . fold

Now, I'm fairly sure the Scala compiler cannot fuse those three loops into a single loop, and the underlying data type is strict, so each (.) corresponds to an intermediate array being created. The imperative/mutable solution would reuse the buffer each time, avoiding copies.

Now, an understanding of what composing those three functions means is key to understanding performance in a functional programming language -- and indeed, in Haskell, those three loops will be optimized into a single loop that reuses an underlying buffer -- but Scala cannot do that.

There are benefits to sticking to the combinator approach, however -- by distinguishing those three functions, it will be easier to parallelize the code (replace map with parMap etc). In fact, given the right array type, (such as a parallel array) a sufficiently smart compiler will be able to automatically parallelize your code, yielding more performance wins.

So, in summary:

  • naive translations may have unexpected copies and inefficiences
  • clever FP compilers remove this overhead (but Scala can't yet)
  • sticking to the high level approach pays off if you want to retarget your code, e.g. to parallelize it
Don Stewart
A: 

Your functional solution is slow because it is generating unnecessary temporary data structures. Removing these is known as deforesting and it is easily done in strict functional languages by rolling your anonymous functions into a single anonymous function and using a single aggregator. For example, your solution written in F# using zip, map and reduce:

let dot xs ys = Array.zip xs ys |> Array.map (fun (x, y) -> x * y) -> Array.reduce ( * )

may be rewritten using fold2 so as to avoid all temporary data structures:

let dot xs ys = Array.fold2 (fun t x y -> t + x * y) 0.0 xs ys

This is a lot faster and the same transformation can be done in Scala and other strict functional languages. In F#, you can also define the fold2 as inline in order to have the higher-order function inlined with its functional argument whereupon you recover the optimal performance of the imperative loop.

Jon Harrop
You can always hand fuse your loops - in any language: strict, lazy, pure, impure, doesn't matter.
Don Stewart
+4  A: 

The Scala collections library is fully generic, and the operations provided are chosen for maximum capability, not maximum speed. So, yes, if you use a functional paradigm with Scala without paying attention (especially if you are using primitive data types), your code will take longer to run (in most cases) than if you use an imperative/iterative paradigm without paying attention.

That said, you can easily create non-generic functional operations that perform quickly for your desired task. In the case of working with pairs of floats, we might do the following:

class FastFloatOps(a: Array[Float]) {
  def fastMapOnto(f: Float => Float) = {
    var i = 0
    while (i < a.length) { a(i) = f(a(i)); i += 1 }
    this
  }
  def fastMapWith(b: Array[Float])(f: (Float,Float) => Float) = {
    val len = a.length min b.length
    val c = new Array[Float](len)
    var i = 0
    while (i < len) { c(i) = f(a(i),b(i)); i += 1 }
    c
  }
  def fastReduce(f: (Float,Float) => Float) = {
    if (a.length==0) Float.NaN
    else {
      var r = a(0)
      var i = 1
      while (i < a.length) { r = f(r,a(i)); i += 1 }
      r
    }
  }
}
implicit def farray2fastfarray(a: Array[Float]) = new FastFloatOps(a)

and then these operations will be much faster. (Faster still if you use Double and 2.8.RC1, because then the functions (Double,Double)=>Double will be specialized, not generic; if you're using something earlier, you can create your own abstract class F { def f(a: Float) : Float } and then call with new F { def f(a: Float) = a*a } instead of (a: Float) => a*a.)

Anyway, the point is that it's not the functional style that makes functional coding in Scala slow, it's that the library is designed with maximum power/flexibility in mind, not maximum speed. This is sensible, since each person's speed requirements are typically subtly different, so it's hard to cover everyone supremely well. But if it's something you're doing more than just a little, you can write your own stuff where the performance penalty for a functional style is extremely small.

Rex Kerr
+11  A: 
Norman Ramsey
It's not just fusion though -- though that's the biggest win here. There are other optimizations missing that are going to contribute (case-of-case, for example), and that's all before you get to the runtime, where allocations will kill you.
Don Stewart
How well does F# perform? Does CLR has more optimizations for FP?
Wei Hu
@Wei Hu - I found this link which suggests that the answer is a qualified "yes": http://stackoverflow.com/questions/142985/is-a-program-f-any-more-efficient-execution-wise-than-c. Although, interestingly, the GHC FAQ treats the CLR and the JVM as sharing similar disadvantages as far as Haskell is concerned: http://www.haskell.org/haskellwiki/GHC:FAQ#Why_isn.27t_GHC_available_for_.NET_or_on_the_JVM.3F
rtperson
F# is relatively well performing (even in comparison with Scala) in FP since e.g. CLR-generics *don't* need boxing and casts and closures are handled efficiently.
Dario
+24  A: 

The main reasons why these two examples are so different in speed are:

  • the faster one doesn't use any generics, so it doesn't face boxing/unboxing.
  • the faster one doesn't create temporary collections and, thus, avoids extra memory copies.

Let's consider the slower one by parts. First:

first.zip(second)

That creates a new array, an array of Tuple2. It will copy all elements from both arrays into Tuple2 objects, and then copy a reference to each of these objects into a third array. Now, notice that Tuple2 is parameterized, so it can't store Float directly. Instead, new instances of java.lang.Float are created for each number, the numbers are stored in them, and then a reference for each of them is stored into the Tuple2.

map{ case (a,b) => a*b }

Now a fourth array is created. To compute the values of these elements, it needs to read the reference to the tuple from the third array, read the reference to the java.lang.Float stored in them, read the numbers, multiply, create a new java.lang.Float to store the result, and then pass this reference back, which will be de-referenced again to be stored in the array (arrays are not type-erased).

We are not finished, though. Here's the next part:

reduceLeft(_+_)

That one is relatively harmless, except that it still do boxing/unboxing and java.lang.Float creation at iteration, since reduceLeft receives a Function2, which is parameterized.

Scala 2.8 introduces a feature called specialization which will get rid of a lot of these boxing/unboxing. But let's consider alternative faster versions. We could, for instance, do map and reduceLeft in a single step:

sum = first.zip(second).foldLeft(0f) { case (a, (b, c)) => a + b * c }

We could use view (Scala 2.8) or projection (Scala 2.7) to avoid creating intermediary collections altogether:

sum = first.view.zip(second).map{ case (a,b) => a*b }.reduceLeft(_+_)

This last one doesn't save much, actually, so I think the non-strictness if being "lost" pretty fast (ie, one of these methods is strict even in a view). There's also an alternative way of zipping that is non-strict (ie, avoids some intermediary results) by default:

sum = (first,second).zipped.map{ case (a,b) => a*b }.reduceLeft(_+_)

This gives much better result that the former. Better than the foldLeft one, though not by much. Unfortunately, we can't combined zipped with foldLeft because the former doesn't support the latter.

The last one is the fastest I could get. Faster than that, only with specialization. Now, Function2 happens to be specialized, but for Int, Long and Double. The other primitives were left out, as specialization increases code size rather dramatically for each primitive. On my tests, though Double is actually taking longer. That might be a result of it being twice the size, or it might be something I'm doing wrong.

So, in the end, the problem is a combination of factors, including producing intermediary copies of elements, and the way Java (JVM) handles primitives and generics. A similar code in Haskell using supercompilation would be equal to anything short of assembler. On the JVM, you have to be aware of the trade-offs and be prepared to optimize critical code.

Daniel
Is the first version with the `for` comprehension not using generics as well?
Jon Harrop
Great answer. Just as a note, I did this first in haskell. I can see that there is some ways to go until scala delivers the efficiency of haskell.
Fred Haslam
@Jon It is indeed. Using a `while` loop would make it even faster in the absence of specialization. But the other version uses it to a greater extent, making the problem much worse.
Daniel
+2  A: 

To answer the question in the title: Simple functional constructs may be slower than imperative on the JVM.

But, if we consider only simple constructs, then we might as well throw out all modern languages and stick with C or assembler. If you look a the programming language shootout, C always wins.

So why choose a modern language? Because it lets you express a cleaner design. Cleaner design leads to performance gains in the overall operation of the application. Even if some low-level methods can be slower. One of my favorite examples is the performance of BuildR vs. Maven. BuildR is written in Ruby, an interpreted, slow, language. Maven is written in Java. A build in BuildR is twice as fast as Maven. This is due mostly to the design of BuildR which is lightweight compared with that of Maven.

IttayD
igouy
+14  A: 

I did some variations of this with Scala 2.8. The loop version is as you write but the functional version is slightly different:

(xs, ys).zipped map (_ * _) reduceLeft(_ + _)

I ran with Double instead of Float, because currently specialization only kicks in for Double. I then tested with arrays and vectors as the carrier type. Furthermore, I tested Boxed variants which work on java.lang.Double's instead of primitive Doubles to measure the effect of primitive type boxing and unboxing. Here is what I got (running Java 1.6_10 server VM, Scala 2.8 RC1, 5 runs per test).

loopArray               461             437             436             437             435
reduceArray             6573            6544            6718            6828            6554
loopVector              5877            5773            5775            5791            5657
reduceVector            5064            4880            4844            4828            4926

loopArrayBoxed          2627            2551            2569            2537            2546
reduceArrayBoxed        4809            4434            4496            4434            4365
loopVectorBoxed         7577            7450            7456            7463            7432
reduceVectorBoxed       5116            4903            5006            4957            5122

The first thing to notice is that by far the biggest difference is between primitive array loops and primitive array functional reduce. It's about a factor of 15 instead of the 40 you have seen, which reflects improvements in Scala 2.8 over 2.7. Still, primitive array loops are the fastest of all tests whereas primitive array reduces are the slowest. The reason is that primitive Java arrays and generic operations are just not a very good fit. Accessing elements of primitive Java arrays from generic functions requires a lot of boxing/unboxing and sometimes even requires reflection. Future versions of Scala will specialize the Array class and then we should see some improvement. But right now that's what it is.

If you go from arrays to vectors, you notice several things. First, the reduce version is now faster than the imperative loop! This is because vector reduce can make use of efficient bulk operations. Second, vector reduce is faster than array reduce, which illustrates the inherent overhead that arrays of primitive types pose for generic higher-order functions.

If you eliminate the overhead of boxing/unboxing by working only with boxed java.lang.Double values, the picture changes. Now reduce over arrays is a bit less than 2 times slower than looping, instead of the 15 times difference before. That more closely approximates the inherent overhead of the three loops with intermediate data structures instead of the fused loop of the imperative version. Looping over vectors is now by far the slowest solution, whereas reducing over vectors is a little bit slower than reducing over arrays.

So the overall answer is: it depends. If you have tight loops over arrays of primitive values, nothing beats an imperative loop. And there's no problem writing the loops because they are neither longer nor less comprehensible than the functional versions. In all other situations, the FP solution looks competitive.

Martin Odersky
Should build.sizeHint be used in Tuple2#Zipped#map and Traversable#map to improve the performance?
Eastsun
I just tried and it helps quite a bit. The time for reduceArray goes down from ~6400 ms to ~5200ms. The time for reduceVector goes down from ~5900ms to ~4800ms. And there might be just enough time to roll this into 2.8 RC2. Thanks for suggesting it!
Martin Odersky
A: 

Here is dbyrnes solution with arrays (assuming Arrays are to be used) and just iterating over the index:

def multiplyAndSum (l1: Array[Int], l2: Array[Int]) : Int = 
{
    def productSum (idx: Int, sum: Int) : Int = 
        if (idx < l1.length)
            productSum (idx + 1, sum + (l1(idx) * l2(idx))) else 
                sum
    if (l2.length == l1.length) 
        productSum (0, 0) else 
    error ("lengths don't fit " + l1.length + " != " + l2.length) 
}


val first = (1 to 500).map (_ * 1.1) toArray                                                
val second = (11 to 510).map (_ * 1.2) toArray     
def loopi (n: Int) = (1 to n).foreach (dummy => multiplyAndSum (first, second))
println (timed (loopi (100*1000)))

That needs about 1/40 of the time of the list-approach. I don't have 2.8 installed, so you have to test @tailrec yourself. :)

user unknown