views:

282

answers:

3

I was just curious about some exact implementation details of lists in Haskell (GHC-specific answers are fine)--are they naive linked lists, or do they have any special optimizations? More specifically:

  1. Do length and (!!) (for instance) have to iterate through the list?
  2. If so, are their values cached in any way (i.e., if I call length twice, will it have to iterate both times)?
  3. Does access to the back of the list involve iterating through the whole list?
  4. Are infinite lists and list comprehensions memoized? (i.e., for fib = 1:1:zipWith (+) fib (tail fib), will each value be computed recursively, or will it rely on the previous computed value?)

Any other interesting implementation details would be much appreciated. Thanks in advance!

+3  A: 

As far as I know (I don't know how much of this is GHC-specific)

  1. length and (!!) DO have to iterate through the list.

  2. I don't think there are any special optimisations for lists, but there is a technique that applies to all datatypes.

    If you have something like

    foo xs = bar (length xs) ++ baz (length xs)
    

    then length xs will be computed twice.

    But if instead you have

    foo xs = bar len ++ baz len
      where len = length xs
    

    then it will only be computed once.

  3. Yes.

  4. Yes, once part of a named value is computed, it is retained until the name goes out of scope. (The language doesn't require this, but this is how I understand the implementations behave.)

Dave Hinton
For 2., I meant if I have `doubleLength xs = length xs + length xs` (contrived, I know), will it compute length both times?
eman
@eman: see edit. I think it will only compute it once. I'm sure someone more knowledgeable will be along soon to correct me if I'm wrong.
Dave Hinton
GHC does not do common subexpression elimination by default. This is because it can be catastrophic in some cases, for example: sum [1..10^6] / fromIntegral (length [1..10^6]), if [1..10^6] were shared here then this computation would take 8 MB and take a long time because GC load. Here it is much better to recompute the list than to share it. But you are correct that if you name it -- eg. let len = length xs in bar len ++ baz len -- then it will be shared. This isn't in the standard, just GHC and every other reasonable compiler. :-)
luqui
@luqui: so in that case, it would compute `length xs` both times unless you have a named expression?
eman
@eman, in your example, yes. GHC might be able to tell that sharing an expression of type int couldn't possibly leak, but I don't think it does.
luqui
@luqui: Thanks, I've edited my answer to reflect that.
Dave Hinton
+7  A: 

Lists have no special operational treatment in Haskell. They are defined just like:

data List a = Nil | Cons a (List a)

Just with some special notation: [a] for List a, [] for Nil and (:) for Cons. If you defined the same and redefined all the operations, you would get the exact same performance.

Thus, Haskell lists are singly-linked. Because of laziness, they are often used as iterators. sum [1..n] runs in constant space, because the unused prefixes of this list are garbage collected as the sum progresses, and the tails aren't generated until they are needed.

As for #4: all values in Haskell are memoized, with the exception that functions do not keep a memo table for their arguments. So when you define fib like you did, the results will be cached and the nth fibonacci number will be accessed in O(n) time. However, if you defined it in this apparently equivalent way:

-- Simulate infinite lists as functions from Integer
cons x xs n | n == 0    = x
            | otherwise = xs (n-1)
tailF xs n = xs (n+1)

fib = 1 `cons` (1 `cons` (\n -> fib n + tailF fib n))

(Take a moment to note the similarity to your definition)

Then the results are not shared and the nth fibonacci number will be accessed in O(fib n) (which is exponential) time. You can convince functions to be shared with a memoization library like data-memocombinators.

luqui
Thanks for the detailed answer!
eman
+3  A: 

If so, are their values cached in any way (i.e., if I call length twice, will it have to iterate both times)?

GHC does not perform full Common Subexpression Elimination. For example:

{-# NOINLINE aaaaaaaaa #-}
aaaaaaaaa :: [a] -> Int
aaaaaaaaa x = length x + length x

{-# NOINLINE bbbbbbbbb #-}
bbbbbbbbb :: [a] -> Int
bbbbbbbbb x = l + l where l = length x

main = bbbbbbbbb [1..2000000] `seq` aaaaaaaaa [1..2000000] `seq` return ()

Gives on -ddump-simpl:

Main.aaaaaaaaa [NEVER Nothing] :: forall a_adp.
                                  [a_adp] -> GHC.Types.Int
GblId
[Arity 1
 NoCafRefs
 Str: DmdType Sm]
Main.aaaaaaaaa =
  \ (@ a_ahc) (x_adq :: [a_ahc]) ->
    case GHC.List.$wlen @ a_ahc x_adq 0 of ww_anf { __DEFAULT ->
    case GHC.List.$wlen @ a_ahc x_adq 0 of ww1_Xnw { __DEFAULT ->
    GHC.Types.I# (GHC.Prim.+# ww_anf ww1_Xnw)
    }
    }

Main.bbbbbbbbb [NEVER Nothing] :: forall a_ado.
                                  [a_ado] -> GHC.Types.Int
GblId
[Arity 1
 NoCafRefs
 Str: DmdType Sm]
Main.bbbbbbbbb =
  \ (@ a_adE) (x_adr :: [a_adE]) ->
    case GHC.List.$wlen @ a_adE x_adr 0 of ww_anf { __DEFAULT ->
    GHC.Types.I# (GHC.Prim.+# ww_anf ww_anf)
    }

Note that aaaaaaaaa calls GHC.List.$wlen twice.

(In fact, because x needs to be retained in aaaaaaaaa, it is more than 2x slower than bbbbbbbbb.)

KennyTM