views:

474

answers:

6

I'm working on a library for Python that implements some persistent data structures (mainly as a learning exercise). However, I'm beginning to learn that explaining persistent data structures to people unfamiliar with them can be difficult. Can someone help me think of an easy (or at least the least complicated) way to describe persistent data structures to them?

I've had a couple of people tell me that the documentation that I have is somewhat confusing.

(And before anyone asks, no I don't mean persistent data structures as in persisted to the file system. Google persistent data structures if you're unclear on this.)

+3  A: 

Try this:

Persistent data structures are data structures which, unlike arrays or lists, always preserve their original structure and contents.

They are effectively immutable - instead of changing such a structure, any computation will yield a new copy leaving the original structure unchanged. Whereas conventional data structures follow an imperative (change-based) philosophy, persistent ones are commonly used in functional programming where, in extreme cases, no change is allowed at all. The advantage of this paradigm is that persistent data are much safer, often easier to handle and even preventing bugs connected to unintended change and multithreading issues. Many problems can be formulated functionally in a much more concise way.

Example for a non-persistent data structure: The list (vector-list)

list = [1, 2, 3]
# List is [1, 2, 3]

list += [4]
# Added an element at the end

list[0] = 42
# We changed the first element to 42
# List is [42, 2, 3, 4]

Persistence example: Linked lists (Haskell)

list = [1, 2, 3]
# List is [1, 2, 3]

list2 = 0 : list
# List2 combines [0] and [1, 2, 3] to [0, 1, 2, 3]
# List is still [1, 2, 3]

list3 = updateAt 0 42 list
# List3 contains the elements of list2 with the first one set to 42
# List and List2 still hold their original values
Dario
Kind of a combination of "immutable" and "copy on write". For instance, these lists are very tuplish - note the behavior of a1 =(1,2,3); a2 = a1; a1 += (4,). Except tuples do not permit indexed assignment.
Paul McGuire
+4  A: 

Explaining "persistent" data structures is much, much easier if you use the word the Python community uses: "immutable".

Everyone understands immutable numbers. 2+2 can't update 2, it has to create a new value.

Everyone understand mutable lists. [2, 3].append(4) actually updates the [2,3] list. The list is mutable, the individual numbers are not. See above.

The cool step is explaining immutable tuples. (2,3).append(4) can't exist because a tuple is an immutable ("persistent") data structure. In this case, neither the tuple nor the numbers are mutable.

Immutable strings, similarly, makes perfect sense. "hello" + "there" cannot update the string "hello", but can create a new string.

Some people balk at this. Refer back to tuples and integers. It's simpler to have strings behave like numbers than to behave like lists.

Immutable fractions and decimal numbers are good examples of immutable ("persistent") data structures.

S.Lott
Perhaps I could use the word immutable more, but I think the word "persistent" is necessary. Otherwise, how would you differentiate a PersistentList from a tuple?
Jason Baker
I don't think you can make the case that they're a deep, important difference between a PersistentList and a Tuple. A PersistentList may reuse a sublist as an optimization, but that's just a small "prevent a copy" technique. It's not profound enough to dwell on.
S.Lott
I agree that the need is very much a niche need, but it's still an important difference to understand. Maybe the difference isn't earth-shattering, but there needs to be *some* distinction. I mean if I use the term "Immutable list", my first reaction would be to think of a tuple.
Jason Baker
@Jason Baker: And you'd be right. An immutable list *is* a tuple. The "reuse" subtlety of a "persistent" data structure is a subtlety that's almost incomprehensible in a modern era of multi-gigabyte memory and nano-second clocks.
S.Lott
+1  A: 

Other useful concept words to describe persistent data structures are "Snapshot" or "Version", which you may precede by "virtual" because an efficient implementation of a persistent structure would not merely make multiple copies. Persistent Data Structure libraries strive to store and process the structures in efficient fashion.

Partial persistence allows modifications to the current structure but allows previous versions of that structure to be queried. Full persistence allow any version of the structure to be modified and queried. In both cases, any modification does not impact the integrity of previous versions which can still be shown/queried as they were prior to the changes.

I think that for the presentation page of your project, you can use much of the wording from Dario and S.Lott, trying to address the following:

  • Brief description of what Persistent Data Structures are
  • possibly a link or two to basic introduction articles of the concept
  • Quick exit point about this project not being about data persistence (to disk and such)
  • Where this is used
  • What it can do for you
  • Brief code snippets and their output

Also, do remove [from the home page] details about the data type / isinstance and all that. This is valuable info, but only of use for in a user manual or some other in-depth resource.

mjv
Thanks for the comments about isinstance. I was actually debating whether to include it or not.
Jason Baker
+1  A: 

The sort of phrasing I would suggest is that the value of the given structure is persistent across function calls, and that any function calls that result in a change to the structure result in a new structure being created based on copy-on-change semantics.

For example, given a cons type structure L defined as cons(1,cons(2, cons(3, None))), prepending the value 0 to this list results in a new list L' defined as cons(0, L), which expands to cons(0, cons(1,cons(2, cons(3, None)))). However in order to append the value 4 to this list, this requires a change to a value, which results in a copy being created which contains the change: cons(0, cons(1,cons(2, cons(3, cons(4, None)))))

Obviously you can have more efficient implementations that this, but this is one logical representation of what can occur.

Personally prefer the term immutable list, which you can say differs from an ordinary python (ordered) tuple in that it defines appending, prepending, and concatenation, since that's probably clearer to the average developer :-)

Michael Sparks
I thought this sounded like a smart answer. Then I saw who wrote it and it all made sense. :-)
Jason Baker
+1  A: 

I think the key point to make is that persistent data structures are an optimization for handling immutable values.

But they are an important optimization, even with fast cpu and big memory.

For example, Java strings are immutable but they do NOT have a persistent data structure behind the scenes. That means you have to use things like StringBuilder. If they had a persistent data structure then you could just concatenate and not worry about it.

Immutability is becoming important because it helps deal with concurrency. But to make immutability efficient requires persistent data structures.

Clojure has some good explanations of persistent data structures.

Andrew McKinlay
A: 

When you alter a mutable data structure (e.g. insert into a hash table) it changes in-place and the previous version is lost. When you alter a persistent data structure (e.g. insert into a Map) you get a new data structure and the previous version remains valid.

Jon Harrop