Поделиться через


Persistent data structures

When learning to program in a functional language such as Lisp, Ocaml, or Haskell, one of the most difficult aspects of the paradigmic shift is that data in these languages is almost entirely immutable, or read-only. In purely functional programs, there is no assignment, and data structures cannot be modified. But how are we to do anything useful with a data structure we can't modify? The answer is that instead of modifying it, we create new data structures that incorporate the old structure as a part.

The simplest example of this is the singly-linked list. Say you have a pointer to the head of a linked list A. I can take that list and add three new nodes to the beginning to get a new list B. Yet, due to the structure of the list, the old pointer to the first node of the list A will still point to the same data structure as before; you wouldn't even notice I prepended the new elements. We have retained access to both the previous version and updated version of the list simultaneously. Conversely, if I take a pointer to the third node of A, I get a new list C consisting of A with the first two elements removed (called a tail of the list), but existing pointers to A still see the whole list. A data structure with this special property, the ability to retain access to the old version during updates, is called a persistent data structure (not to be confused with a disk-based data structure, which might also be inferred from the word "persistent"). Why might this be useful? There are several reasons:

  • We can treat all lists as immutable, or read-only. As a result, a function can be sure that no one else is going to modify its list. It can safely pass it to untrusted functions, and annoying phenomena like aliasing and race conditions cannot arise.
  • Because the data is immutable, the compiler can make assumptions that allow it to produce faster code, and the runtime and garbage collector can also make simplifying assumptions.
  • It allows us to efficiently "undo" a sequence of operations - just retain a pointer to the old version you want to return to later, and when you're done with the new version, throw it away (assuming you have garbage collection). This is particularly useful for rolling back a failed operation.

On the other hand, other operations on the singly-linked list, such as node removal or insertion in the middle, would overwrite existing pointers and destroy the old version. We can make any data structure (even arrays) persistent by simply making a complete copy of it and then modifying it to perform each operation, but this approach is obviously very expensive in time and space. An active area of programming language research is to find new, more clever data structures that can perform more operations in an efficient, persistent way.

Following is a simple example, taken from Chris Okasaki's excellent and readable PhD thesis, Purely Functional Data Structures (you can also purchase a book version). Say we wish to implement a simple queue. Our queue will support three main operations: add an element to the front, add an element to the back, and remove an element from the front. A singly linked list supports only two of these persistently, adding or removing an element from the front. A simple solution to this problem is to maintain two singly-linked lists, call them L and R. The queue's contents are simply the contents of L followed by the the contents of the reverse of R. For example, if L={1,2} and R={4,3}, then our queue contains the values {1,2,3,4} in that order. To add to the beginning or end of the queue, we simply add to the beginning of either L or R. To remove an element from the front, we remove an element from the front of L. However, it may so happen that L is empty. If this happens, we take R, reverse it, and make it the new L. In other words, we create a new queue {L,R} where L is the reverse of the old R and R is the empty list.

But isn't this operation, reversing a potentially long list, expensive? Not in an amortized sense. Each element added to R will only get moved to L one time ever. Therefore, a sequence of n insertions and n removals will still take only O(n) time, which is an amortized time of O(1) per operation. We say that we can "charge" each of the original n insertions a cost of O(1) to "pay for" the O(n) reversal cost.

Binary search trees and even self-balancing trees like red-black trees can also be made fully persistent: when we add a new node N, we create a new version of N's parent, call it P, with N as a child. We must also create a new version of P's parent with P as a child, and so on up to the root. For a balanced tree, O(log n) space is required for such an insertion operation - more than the O(1) required for an in-place update, but much less than the O(n) required for a complete copy-and-update. These are very useful for building map data structures that need the advantages of persistence.

One key observation to building more clever persistent data structures is that, really, we can modify the old version, as long as it behaves the same as it did before. For example, if we have a binary search tree, we can rebalance the tree without changing its contents; it only becomes more efficient to access. Objects holding existing pointers to the tree won't notice that it's been rebalanced as long as they access it only through its abstract interface and can't touch its internal representation. This can be exploited to maintain an efficient representation as more and more versions are created. On the other hand, this also sacrifices the advantage of avoiding race conditions; we must ensure the data structure is not accessed in an intermediate state where it would not behave the same.

Persistent data structures aren't strictly limited to functional languages. For example, the MOO and ColdMUD virtual environment languages use immutable data structures for built-in string, list, and map types, but have mutable objects. Nevertheless, implementations of persistent data structures are, today, largely limited to functional languages. But they're useful in every language - perhaps one day we will see them in standard libraries. Meanwhile, take a look at Okasaki's thesis, and when thinking about your next design, give a thought to how persistent data structures could simplify your own code.

Comments

  • Anonymous
    November 10, 2005
    The comment has been removed
  • Anonymous
    December 19, 2005
    Applicative data structures are cool, though they definitely have some caveats. They're much easier to deal with in the context of a GC memory system. Otherwise you either need to ref-count all your objects, or assume that objects have global lifetime. I used these in my blog a while back to develop a splay tree (or any rebalancing tree) that performs well in a multi-threaded environment. Normally the splay operation locks out the entire tree leading to synchronous access, however if you turn it into a persistent data structure, you needn't lock the tree during the splay.
  • Anonymous
    December 20, 2005
    Thanks for the comments. Regarding the need for GC, that's a good point. When two data structures held by two different pieces of code can share structure, it becomes difficult to assign an "owner". This is one reason that persistent data structures aren't that common on disk or in filesystems - because garbage collectors for disks are generally either complex or slow.

    Regarding fast multithreaded performance of balanced binary trees, you're probably aware of lock-free data structures that excel in this area without the persistent data structure's log space overhead per operation, but that's another topic.