views:

1186

answers:

7

I am new to functional programming, and now learn Haskell. As an exercise I decided to implement the explicit Euler method for 1D linear diffusion equation. While the code below works correctly, I am not happy about its performance. In fact, I am concerned with memory consumption. I believe that it is related to lazy evaluation, but cannot figure out how I can reduce its memory usage.

The idea of the algorithm is really simple, to make it clear in imperative terms: it takes an `array', and to every inner point it adds a value, which is calculated as a combination of the values in the point itself and in its neighbors. Boundary points are special cases.

So, this is my Euler1D.hs module:

module Euler1D
( stepEuler
, makeu0
) where

-- impose zero flux condition
zeroflux :: (Floating a) => a -> [a] -> [a]
zeroflux mu (boundary:inner:xs) = [boundary+mu*2*(inner-boundary)]

-- one step of integration
stepEuler :: (Floating a) => a -> [a] -> [a]
stepEuler mu u@(x:xs) = (applyBC . (diffused mu)) u
    where
          diffused mu (left:x:[]) = []    -- ignore outer points
          diffused mu (left:x:right:xs) = -- integrate inner points
                   (x+mu*(left+right-2*x)) : diffused mu (x:right:xs)
          applyBC inner = (lbc u') ++ inner ++ (rbc u') -- boundary conditions
               where u' = [head u] ++ inner ++ [last u]
                     lbc = zeroflux mu             -- left boundary
                     rbc = (zeroflux mu) . reverse -- right boundary

-- initial condition
makeu0 :: Int -> [Double]
makeu0 n = [ ((^2) . sin . (pi*) . xi) x | x <- [0..n]]
    where xi x = fromIntegral x / fromIntegral n

And a simple Main.hs:

module Main where

import System ( getArgs )
import Euler1D

main = do
      args <- getArgs
      let n = read $ head args :: Int
      let u0 = makeu0 n
      let un = stepEuler 0.5 u0
      putStrLn $ show $ sum un

For comparison, I also wrote a pure C implementation.

Now, if I try to run Haskell implementation for a sufficiently large size of the array n, I have:

$ time ./eulerhs 200000
100000.00000000112

real    0m3.552s
user    0m3.304s
sys     0m0.128s

For comparison, C version is faster by almost two orders of magnitude:

$ time ./eulerc 200000
100000

real    0m0.088s
user    0m0.048s
sys     0m0.008s

EDIT: This comparison is not really fair, because Haskell version is compiled with profiling flags, and C is not. If I compile both programs with -O2 and both without profiling flags, I can increase n. In this case time ./eulerhs 1000000 takes 0m2.236s, while time ./eulerc 1000000 takes only 0m0.293s. So the problem still remains with all optimizations and without profiling, it is only offset.

I would like also to note, that memory allocation of the Haskell program seems to grow lineary with n. This is probably OK.

But the worst are memory requirements. My Haskell version requires over 100MB (my estimation of the bare minimum in C is 4MB). I think this may be the source of the problem. According to profiling report the program spends 85% of time in GC, and

        total time  =        0.36 secs   (18 ticks @ 20 ms)
        total alloc = 116,835,180 bytes  (excludes profiling overheads)

COST CENTRE                    MODULE               %time %alloc

makeu0                         Euler1D               61.1   34.9
stepEuler                      Euler1D               33.3   59.6
CAF:sum                        Main                   5.6    5.5

I was surprized to see that makeu0 is so expensive. I decided that this is due to its lazy evaluation (if its thunks remain in the memory until the end of stepEuler).

I tried this change in Main.hs:

   let un = u0 `seq` stepEuler 0.5 u0

but did not notice any difference. I have no idea how to reduce memory usage in stepEuler. So, my questions are:

  1. Is there a way in Haskell to build lists / do list comprehensions strictly? In this case there is no benefit to keep it lazy.
  2. How can I reduce overall memory usage in this case? I suppose, I have to make something strict, but fail to see what. In other words, if I have to put some seqs and bangs, where and why?
  3. And finally, the most important, what is the best strategy to identify such costly constructs?

I did read a chapter on profiling and optimization in Real World Haskell, but it remains unclear how exactly I can decide what should be strict and what not.

Please forgive me such a long post.

EDIT2: As suggested by A. Rex in comments, I tried running both programs in valgrind. And this is what I observed. For Haskell program (n=200000) it found:

malloc/free: 33 allocs, 30 frees, 84,109 bytes allocated. ... checked 55,712,980 bytes.

And for C program (after a small fix):

malloc/free: 2 allocs, 2 frees, 3,200,000 bytes allocated.

So, it appears that while Haskell allocates much smaller memory blocks, it does it often, and due to delay in garbage collection, they accumulate and remain in memory. So, I have another question:

  • Is it possible to avoid a lot of small allocations in Haskell? Basically, to declare, that I need to process the whole data structure rather than only its fragments on demand.
+2  A: 

Does forcing "un-laziness" using $! help? as per this answer.

John Mulder
Thank you! I tried to evaluate strictly `u`, an argument of `stepEuler`, and all the arguments in `applyBC` (see http://pastebin.com/f56a2079d), but effect is the opposite: the program allocates almost 50% more (according to valgrind), and runs almost 50% longer.
jetxee
+1  A: 

Per Harleqin's request: Have you tried setting optimization flags? For example, with ghc, you can use add "-O2" just like you would with gcc. (Although I'm not sure what optimization levels exist in ghc; the man page doesn't exactly say ...)

In my past experience, setting this flag has made a tremendous difference. As far as I can tell, runhugs and unoptimized ghc use the most basic, obvious implementation of Haskell; unfortunately, this sometimes isn't very efficient.

But that's just a guess. As I said in my comment, I hope that someone answers your question well. I often have problems analyzing and fixing Haskell's memory usage.

A. Rex
Yes, I did compile with optmization flags: `ghc -O2 -c -prof -auto-all -caf-all -fforce-recomp Euler1D.hs ; ghc -O2 -o eulerhs Main.hs Euler1D.o -prof -auto-all -caf-all -fforce-recomp`
jetxee
A: 

One thing that jumped to my eye now is that the Haskell output is a float, while the C output seems to be integer. I have not yet come to grips with Haskell code, but is there perhaps some place where you have floating point arithmetic in Haskell while C uses integers?

Svante
No, C output is not an integer. Just printf tries to represent the result nicely on print. C arithmetics is entirely in double.
jetxee
+1  A: 

Use switch -fvia-C also.

Hynek -Pichi- Vychodil
+7  A: 
  1. Lists are not the best datastructure for this type of code (with lots of (++), and (last)). You loose a lot of time constucting and deconstructing lists. I'd use Data.Sequence or arrays, as in C versions.

  2. There is no chance for thunks of makeu0 to be garbage-collected, since you need to retain all of them (well, all of the results of "diffuse", to be exact) all the way till the end of computation in order to be able to do "reverse" in applyBC. Which is very expensive thing, considering that you only need two items from the tail of the list for your "zeroflux".

Here is fast hack of you code that tries to achieve better list fusion and does less list (de)constructing:

module Euler1D
( stepEuler
) where

-- impose zero flux condition
zeroflux mu (boundary:inner:xs) = boundary+mu*2*(inner-boundary)

-- one step of integration
stepEuler mu n = (applyBC . (diffused mu)) $ makeu0 n
    where
          diffused mu (left:x:[]) = []    -- ignore outer points
          diffused mu (left:x:right:xs) = -- integrate inner points
                   let y = (x+mu*(left+right-2*x))
                       in y `seq` y : diffused mu (x:right:xs)
          applyBC inner = lbc + sum inner + rbc -- boundary conditions
               where
                     lbc = zeroflux mu ((f 0 n):inner)             -- left boundary
                     rbc = zeroflux mu ((f n n):(take 2 $ reverse inner)) -- right boundary

-- initial condition
makeu0 n = [ f x n | x <- [0..n]]

f x n = ((^2) . sin . (pi*) . xi) x
    where xi x = fromIntegral x / fromIntegral n

For 200000 points, it completes in 0.8 seconds vs 3.8 seconds for initial version

ADEpt
Thank you very much! This is exactly kind of answer I was looking for. I didn't know about Data.Sequence and arrays, but suspected, that it is possible to avoid lots of list operations. P.S. y `seq` y:… in diffused is a nice pattern… also take's in applyBC… Thank you!
jetxee
Though, summation in stepEuler changes the original semantics of the function. I restored the original semantics (http://pastebin.com/f5ca77a7f) and it still runs reasonably fast (0.5 seconds for 2e5 points). I'll consider using Data.Sequence now.
jetxee
+1  A: 

More generally, you can find out where your memory is going using GHC's heap profiling tools. In my experience, they won't necessarily tell you why your data is being leaked, but can at least narrow down the potential causes.

You may also find illuminating this excellent blog post by Don Stewart about understanding strictness, how it interacts with garbage collection, and how to diagnose and fix problems.

daf
Thank you for the links, daf!
jetxee
The post by Don Stewart is really, really helpful. Thank you very much!
jetxee
+1  A: 

On my 32-bit x86 system, your program uses only about 40 MB of memory.

Are you perhaps confusing the the "total alloc = 116,835,180 bytes" line from your profiling output with how much memory is actually used by the program at any one time? The total alloc is how much memory was allocated over the entire program run; much of this is freed by the garbage collector as you go along. You can expect that number to get very large in a Haskell program; I have programs that allocate many terrabytes of memory over the course of their entire run, though they actually have a maximum virtual memory size of a hundred megabytes or so.

I wouldn't worry too much about large total allocations over the course of a program run; that's the nature of a pure language, and GHC's runtime has a very good garbage collector to help compensate for this.

Curt Sampson
Yes, exactly! I did understand "total alloc" like a max memory allocation. Now it is more clear to me. Thank you!
jetxee