views:

271

answers:

5

I'm trying to figure out how non-destructive manipulation of large collections is implemented in functional programming, ie. how it is possible to alter or remove single elements without having to create a completely new collection where all elements, even the unmodified ones, will be duplicated in memory. (Even if the original collection would be garbage-collected, I'd expect the memory footprint and general performance of such a collection to be awful.)

This is how far I've got until now:

Using F#, I came up with a function insert that splits a list into two pieces and introduces a new element in-between, seemingly without cloning all unchanged elements:

// return a list without its first n elements:
// (helper function)
let rec skip list n =
    if n = 0 then
        list
    else
        match list with
        | []    -> []
        | x::xs -> skip xs (n-1)

// return only the first n elements of a list:
// (helper function)
let rec take list n =
    if n = 0 then
        []
    else
        match list with
        | []    -> []
        | x::xs -> x::(take xs (n-1))

// insert a value into a list at the specified zero-based position:
let insert list position value =
    (take list position) @ [value] @ (skip list position)

I then checked whether objects from an original list are "recycled" in new lists by using .NET's Object.ReferenceEquals:

open System

let (===) x y =
    Object.ReferenceEquals(x, y)

let x = Some(42)
let L = [Some(0); x; Some(43)]
let M = Some(1) |> insert L 1

The following three expressions all evaluate to true, indicating that the value referred to by x is re-used both in lists L and M, ie. that there is only 1 copy of this value in memory:

L.[1] === x
M.[2] === x
L.[1] === M.[2]

My question:

Do functional programming languages generally re-use values instead of cloning them to a new memory location, or was I just lucky with F#'s behaviour? Assuming the former, is this how reasonably memory-efficient editing of collections can be implemented in functional programming?

(Btw.: I know about Chris Okasaki's book Purely functional data structures, but haven't yet had the time to read it thoroughly.)

+5  A: 

While the referenced objects are the same in your code, I beleive the storage space for the references themselves and the structure of the list is duplicated by take. As a result, while the referenced objects are the same, and the tails are shared between the two lists, the "cells" for the initial portions are duplicated.

I'm not an expert in functional programming, but maybe with some kind of tree you could achieve duplication of only log(n) elements, as you would have to recreate only the path from the root to the inserted element.

Jérémie Koenig
I think you're right on the tree structure, I've seen exactly such an example discussed elsewhere. I wanted to try (linked) lists first because these were the simplest collection type I could think of.Do you mention `take` because that's where a new cons cell is created by `x::(take xs (n-1))`?
stakx
Indeed -- on the other hand `skip` returns the original tail,so it is not duplicated.
Jérémie Koenig
+24  A: 
Juliet
Ha, I can always tell a Juliet answer before I get to the bottom. It's all like, "Boom! Charts!"
Chuck
stakx
A: 

You may be interested in reduction strategies of expressions in functional programming languages. A good book on the subject is The Implementation of Functional Programming Languages, by Simon Peyton Jones, one of the creators of Haskell. Have a look especially at the following chapter Graph Reduction of Lambda Expressions since it describes the sharing of common subexpressions. Hope it helps, but I'm afraid it applies only to lazy languages.

primodemus
+4  A: 

It sounds to me like your question is primarily about immutable data, not functional languages per se. Data is indeed necessarily immutable in purely functional code (cf. referential transparency), but I'm not aware of any non-toy languages that enforce absolute purity everywhere (though Haskell comes closest, if you like that sort of thing).

Roughly speaking, referential transparency means that no practical difference exists between a variable/expression and the value it holds/evaluates to. Because a piece of immutable data will (by definition) never change, it can be trivially identified with its value and should behave indistinguishably from any other data with the same value.

Therefore, by electing to draw no semantic distinction between two pieces of data with the same value, we have no reason to ever deliberately construct a duplicate value. So, in cases of obvious equality (e.g., adding something to a list, passing it as a function argument, &c.), languages where immutability guarantees are possible will generally reuse the existing reference, as you say.

Likewise, immutable data structures possess an intrinsic referential transparency of their structure (though not their contents). Assuming all contained values are also immutable, this means that pieces of the structure can safely be reused in new structures as well. For example, the tail of a cons list can often be reused; in your code, I would expect that:

(skip 1 L) === (skip 2 M)

...would also be true.

Reuse isn't always possible, though; the initial portion of a list removed by your skip function can't really be reused, for instance. For the same reason, appending something to the end of a cons list is an expensive operation, as it must reconstruct a whole new list, similar to the problem with concatenating null-terminated strings.

In such cases, naive approaches quickly get into the realm of awful performance you were concerned about. Often, it's necessary to substantially rethink fundamental algorithms and data structures to adapt them successfully to immutable data. Techniques include breaking structures into layered or hierarchical pieces to isolate changes, inverting parts of the structure to expose cheap updates to frequently-modified parts, or even storing the original structure alongside a collection of updates and combining the updates with the original on the fly only when the data is accessed.

Since you're using F# here I'm going to assume you're at least somewhat familiar with C#; the inestimable Eric Lippert has a slew of posts on immutable data structures in C# that will probably enlighten you well beyond what I could provide. Over the course of several posts he demonstrates (reasonably efficient) immutable implementations of a stack, binary tree, and double-ended queue, among others. Delightful reading for any .NET programmer!

camccann
Thank you for the pointer to Eric Lippert's blog. It's indeed very helpful, since I program mostly in C# (but hope to improve my skills through a better understanding of functional programming).
stakx
Oh, and thanks not just for the reference, but for your post as well, of course! I found the remark about breaking structures into hierarchical / layered data to be very useful, too.
stakx
+10  A: 
Norman Ramsey