tags:

views:

238

answers:

5

Hello everyone, I've designed a function to compute the mean of a list. Although it works fine, but I think it may not be the best solution due to it takes two functions rather than one. Is it possible to do this job done with only one recursive function ?

calcMeanList (x:xs) = doCalcMeanList (x:xs) 0 0

doCalcMeanList (x:xs) sum length =  doCalcMeanList xs (sum+x) (length+1)
doCalcMeanList [] sum length = sum/length
+3  A: 

While I am not sure whether or not it would be 'best' to write it in one function, it can be done as follows:

If you know the length (lets call it 'n' here) in advance its easy - you can calculate how much each value 'adds' to the average; that is going to be value/length. Since avg(x1, x2, x3) = sum(x1, x2, x3)/length = (x1 + x2 + x3)/3 = x1/3 + x2/3 + x2/3

If you don't know the length in advance, its a little trickier:

lets say we use the list {x1,x2,x3} without knowing its n=3.

first iteration would just be x1 (since we assume its only n=1) second iteration would add x2/2 and divide the existing average by 2 so now we have x1/2 + x2/2

after the third iteration we have n=3 and we would want to have x1/3 +x2/3 + x3/3 but we have x1/2 + x2/2

so we would need to multiply by (n-1) and divide by n to get x1/3 + x2/3 and to that we just add the current value (x3) divided by n to end up with x1/3 + x2/3 + x3/3

Generally:

given an average (arithmetic mean - avg) for n-1 items, if you want to add one item(newval) to the average your equation will be:

avg*(n-1)/n + newval/n. The equation can be proven mathematically using induction.

Hope this helps.

*note this solution is less efficient than simply summing the variables and dividing by the total length as you do in your example.

Assaf
+7  A: 

Your solution is good, using two functions is not worse than one. Still, you might put the tail recursive function in a where clause.

But if you want to do it in one line:

calcMeanList = uncurry (/) . foldr (\e (s,c) -> (e+s,c+1)) (0,0)
Chris
why foldr and not foldl? seems a much better fit to me.
Axman6
foldl, foldl' or foldr can be used here as you must traverse the entire list anyway (it was the one I picked)... I think if the performance matters foldl' may be used here
Chris
+4  A: 

When I saw your question, I immediately thought "you want a fold there!"

And sure enough, a similar question has been asked before on StackOverflow, and this answer has a very performant solution, which you can test in an interactive environment like GHCi:

import Data.List

let avg l = let (t,n) = foldl' (\(b,c) a -> (a+b,c+1)) (0,0) l 
            in realToFrac(t)/realToFrac(n)

avg ([1,2,3,4]::[Int])
2.5
avg ([1,2,3,4]::[Double])
2.5
Daniel Pryden
+2  A: 

For those who are curious to know what glowcoder's and Assaf's approach would look like in Haskell, here's one translation:

avg [] = 0
avg x@(t:ts) = let xlen = toRational $ length x
                   tslen = toRational $ length ts
                   prevAvg = avg ts
               in (toRational t) / xlen + prevAvg * tslen / xlen

This way ensures that each step has the "average so far" correctly calculated, but does so at the cost of a whole bunch of redundant multiplying/dividing by lengths, and very inefficient calculations of length at each step. No seasoned Haskeller would write it this way.

An only slightly better way is:

avg2 [] = 0
avg2 x = fst $ avg_ x
    where 
      avg_ [] = (toRational 0, toRational 0)
      avg_ (t:ts) = let
           (prevAvg, prevLen) = avg_ ts
           curLen = prevLen + 1
           curAvg = (toRational t) / curLen + prevAvg * prevLen / curLen
        in (curAvg, curLen)

This avoids repeated length calculation. But it requires a helper function, which is precisely what the original poster is trying to avoid. And it still requires a whole bunch of canceling out of length terms.

To avoid the cancelling out of lengths, we can just build up the sum and length and divide at the end:

avg3 [] = 0
avg3 x = (toRational total) / (toRational len)
    where 
      (total, len) = avg_ x
      avg_ [] = (0, 0)
      avg_ (t:ts) = let 
          (prevSum, prevLen) = avg_ ts
       in (prevSum + t, prevLen + 1)

And this can be much more succinctly written as a foldr:

avg4 [] = 0
avg4 x = (toRational total) / (toRational len)
    where
      (total, len) = foldr avg_ (0,0) x
      avg_ t (prevSum, prevLen) = (prevSum + t, prevLen + 1)

which can be further simplified as per the posts above.

Fold really is the way to go here.

Owen S.
+3  A: 

About the best you can do is this version:

import qualified Data.Vector.Unboxed as U

data Pair = Pair {-# UNPACK #-}!Int {-# UNPACK #-}!Double

mean :: U.Vector Double -> Double
mean xs = s / fromIntegral n
  where
    Pair n s       = U.foldl' k (Pair 0 0) xs
    k (Pair n s) x = Pair (n+1) (s+x)

main = print (mean $ U.enumFromN 1 (10^7))

It fuses to an optimal loop in Core (the best Haskell you could write):

main_$s$wfoldlM'_loop :: Int#
                              -> Double#
                              -> Double#
                              -> Int#
                              -> (# Int#, Double# #)    
main_$s$wfoldlM'_loop =
  \ (sc_s1nH :: Int#)
    (sc1_s1nI :: Double#)
    (sc2_s1nJ :: Double#)
    (sc3_s1nK :: Int#) ->
    case ># sc_s1nH 0 of _ {
      False -> (# sc3_s1nK, sc2_s1nJ #);
      True ->
        main_$s$wfoldlM'_loop
          (-# sc_s1nH 1)
          (+## sc1_s1nI 1.0)
          (+## sc2_s1nJ sc1_s1nI)
          (+# sc3_s1nK 1)
    }

And the following assembly:

Main_mainzuzdszdwfoldlMzqzuloop_info:
.Lc1pN:
        testq %r14,%r14
        jg .Lc1pQ
        movq %rsi,%rbx
        movsd %xmm6,%xmm5
        jmp *(%rbp)
.Lc1pQ:
        leaq 1(%rsi),%rax
        movsd %xmm6,%xmm0
        addsd %xmm5,%xmm0
        movsd %xmm5,%xmm7
        addsd .Ln1pS(%rip),%xmm7
        decq %r14
        movsd %xmm7,%xmm5
        movsd %xmm0,%xmm6
        movq %rax,%rsi
        jmp Main_mainzuzdszdwfoldlMzqzuloop_info

Based on Data.Vector. For example,

$ ghc -Odph --make A.hs -fforce-recomp
[1 of 1] Compiling Main             ( A.hs, A.o )
Linking A ...
$ time ./A
5000000.5
./A  0.04s user 0.00s system 93% cpu 0.046 total

See the efficient implementations in the statistics package.

Don Stewart