views:

1002

answers:

5

I've been comparing for fun different languages for speed in execution of the following program: for i from 1 to 1000000 sum the product i*(sqrt i)

One of my implementations (not the only one) is constructing a list [1..1000000] and then folding with a specific funtion.

The program works fine and fast in Haskell (even when using foldl and not foldl') but stack overflows in OCaml and F#.

Here is the Haskell code:

test = foldl (\ a b -> a + b * (sqrt b)) 0

create 0 = []
create n = n:(create (n-1))

main = print (test (create 1000000))

And here is the OCaml one:

let test = List.fold_left (fun a b -> a +. (float_of_int b) *. (sqrt (float_of_int b))) 0. ;;

let rec create = function
    | 0 -> []
    | n -> n::(create (n-1)) ;;

print_float (test (create 1000000));;

Why does the OCaml/F# implementation stack overflows?

+25  A: 

The Haskell code for create evaluates the list lazily, i.e. as elements are needed by foldl. The entire list is not needed all at once.

By contrast, F# and OCaml evaluate the create list strictly, i.e. the code tries to generate the entire list in one go before passing it to fold_left.

One possibility in F# is to use a seq expression in the create function: this generates the list lazily in the same way as Haskell. (I'm not sure if OCaml has an equivalent feature.)

Tim Robinson
OCaml has `lazy` and `Lazy.force`. It used to be the obvious ML-side implementation that anyone could write with a reference to a closure. Special support was integrated in the runtime in 3.0? and more support (pattern-matching) in 3.11.0. Nowadays, the GC removes the indirection some time after the suspension has been forced. http://ralyx.inria.fr/2008/Raweb/gallium/uid48.html
Pascal Cuoq
+7  A: 

In this form your create function is not tail-recursive. You can rewrite it in a tail recursive form that doesn't cause a stack overflow:

let rec create n l =
    match n with 
    | 0 -> l
    | n -> create (n-1) (n::l)

print_float (test (create 1000000 []));;
Stringer Bell
That's true, but for some reason it has a huge performance impact (at least for Haskell), execution time is multiplied by 3.
Fernand Pajot
@Fernand: In Haskell, it's better to use your original function because of lazy evaluation. In absence of lazy evaluation, it's better -- actually almost necessary -- to use a tail recursive version.
Michał Marczyk
In this version doesn't OCaml still build the entire list in memory befor computing the fold? In Haskell the list is built as it is consumed, so it's never in memory all at once.
MtnViewMark
@mtnviewmark: In OCaml you could use the LazyList module that's part of BatteriesIncluded so that it would work very similarly to the Haskell version: http://batteries.forge.ocamlcore.org/doc.preview%3Abatteries-beta1/html/api/Lazy_list.html
aneccodeal
+7  A: 

Another way to fix the code so that it works in F# is to use sequence expressions that are also generated lazily and in a way that doesn't cause StackOverflow (this won't work in OCaml, because sequence expressions are F# specific). Here is the code:

let test nums = Seq.fold (fun a b -> 
  a + (float b) * (sqrt (float b))) 0.0 nums

let rec create count = seq {
  match count with
  | 0 -> do ()
  | n -> yield n
         yield! create (n-1) }

printf "%f" (test (create 1000000));; 

I'm not sure about the performance, however the compiler definitely does an optimization in the create function, so that iterating over the generated sequence should be reasonably fast. However, the aim of sequence expressions is mainly readability (since F# is not pure, it gives you a lot of space for manual optimizations if you really need them as opposed to Haskell which needs to optimize the pure code you write). Anyway, if you have some tests, you can give it a try!

Tomas Petricek
And to make the OCaml version work lazily like the Haskell version you could use the Lazy_list module that's part of BatteriesIncluded. http://batteries.forge.ocamlcore.org/doc.preview%3Abatteries-beta1/html/api/Lazy_list.html
aneccodeal
Thanks for the addition - F# also has `LazyList` module (it can be found in PowerPack), but it is used less frequently than _sequence expressions_.
Tomas Petricek
+17  A: 

Firstly, if you're doing performance comparisons for numeric stuff, lists aren't the best choice. Try a package like the vector package for fast arrays.

And note that you can do even better in Haskell, thanks to loop fusion. By writing the create function as an enumeration the compiler can combine the create step, and the fold loop, into a single loop that allocates no intermediate data structures. The ability to do general fusion like this is unique to GHC Haskell.

I'll use the vector library (stream-based loop fusion):

import qualified Data.Vector as V

test = V.foldl (\ a b -> a + b * sqrt b) 0

create n = (V.enumFromTo 1 n)

main = print (test (create 1000000))

Now, before, with your code, the compiler is unable to remove all lists, and we end up with an inner loop like:

$wlgo :: Double# -> [Double] -> Double#

$wlgo =
  \ (ww_sww :: Double#) (w_swy :: [Double]) ->
    case w_swy of _ {
      [] -> ww_sww;
      : x_aoY xs_aoZ ->
        case x_aoY of _ { D# x1_aql ->
        $wlgo
          (+##
             ww_sww (*## x1_aql (sqrtDouble# x1_aql)))
          xs_aoZ
        }
    }

$wcreate :: Double# -> [Double]

$wcreate =
  \ (ww_swp :: Double#) ->
    case ==## ww_swp 0.0 of _ {
      False ->
        :
          @ Double
          (D# ww_swp)
          ($wcreate (-## ww_swp 1.0));
      True -> [] @ Double
    }

Note how there are two loops: create generating a (lazy) list, and the fold consuming it. Thanks to laziness, the cost of that list is cheap, so it runs in a respectable:

$ time ./C
4.000004999999896e14
./C  0.06s user 0.00s system 98% cpu 0.058 total

Under fusion, however, we get instead a single loop only!

main_$s$wfoldlM_loop :: Double# -> Double# -> Double#

main_$s$wfoldlM_loop =
  \ (sc_sYc :: Double#) (sc1_sYd :: Double#) ->
    case <=## sc_sYc 1000000.5 of _ {
      False -> sc1_sYd;
      True ->
        main_$s$wfoldlM_loop
          (+## sc_sYc 1.0)
          (+##
             sc1_sYd (*## sc_sYc (sqrtDouble# sc_sYc)))

GHC reduced our create and test steps into a single loop with no lists used. Just 2 doubles in registers. And with half as many loops, it runs nearly twice as fast:

$ ghc D.hs -Odph -fvia-C -optc-O3 -optc-march=native -fexcess-precision --make
$ time ./D
4.000005000001039e14
./D  0.04s user 0.00s system 95% cpu 0.038 total

This is a nice example of the power that guarantees of purity provide -- the compiler can be very aggressive at reording your code.

Don Stewart
Weird, for me my version takes 44ms to execute, and yours 76!
Fernand Pajot
It requires a patch to the vector library for fusible Double (get it from the darcs version of vector, http:://code.haskell.org/vector )
Don Stewart
@Farnand Pajot: did you use dons' command line to compile the code? I think loop fusion happens only when optimisations are enabled.
liwp
Yes I did, I also used those optimizations for a simple tail-recursive function (no list involved) and execution time is 3 times better!
Fernand Pajot
+2  A: 

The program works fine and fast in Haskell (even when using foldl and not foldl') but stack overflows in OCaml and F#.

Your Haskell is using lazy lists but your OCaml/F# is using strict lists, so your programs are incomparable.

FWIW, a solution using on-demand sequences in F# is:

Seq.sumBy (fun i -> let i = float i in i * sqrt i) (seq {1..1000000})
Jon Harrop