Functional Data Structures Notes

I read Okazaki’s mind-opening book on functional data structures a couple of years ago and recently found my notes.

The key difference of functional data structures is “persistence” – that prior “versions” are still reachable and valid when the data structure is “modified”. (Contrast ephemeral data structures, where modification destroys previous versions.) Persistent data structures are commonly used in functional programming languages, where destructive mutation is avoided – it’s a side-effect with poor composability and potentially non-local, non-thread-safe consequences.

Key to understanding their performance is “amortised complexity” – the time complexity of a structure over a series of operations, which individually may have different performance. This can be determined using either the physicist’s or banker’s analysis, which are essentially methods of accounting. The simplest example is probably the double-list-as-queue, where you eventually have to reverse the tail queue (see Haskell example below). On average enqueue-dequeue is constant, but particular operations are linear in the number of items in the back of the queue.

data Queue a = Queue [a] [a]

enqueue :: Queue a -> a -> Queue a
enqueue (Queue front back) x =Queue front (x:back)

dequeue :: Queue a -> Maybe (a, Queue a)
dequeue (Queue [] []) = Nothing
dequeue (Queue (x:xs) back) = Just (x, Queue xs back)
dequeue (Queue [] back) =
    let kcab = reverse back in
    Just (head kcab, Queue (tail kcab) [])

Worse, you could be forced to replay the linear-time dequeue, causing the time-complexity to blow out. The solution to that is to use laziness, so that if a pathological dequeue is replayed, the operation has already completed inside the data structure.

Scheduling is the periodic forcing of suspensions within a data structure, and can be used to make amortised complexity bounds hold in the worst case. In the list example, we could force one step of the reversal each dequeue, so that the reversal is complete when it is needed, and each dequeue is in constant time.

This example has more general analogues.

“Batched rebuilding” is the rebuilding of data structures (eg heaps) after batches of operations instead of at every step, and aims at having good amortised complexity.

“Global rebuilding” uses incremental execution to eliminate the amortisation from batched rebuilding, using two copies of the data – one to answer queries, and the other in the process of building rebuilt. However modifications have to be applied to both, and themselves might need to be buffered, leading to more complexity. (No analogue is shown here, as Haskell is already a lazy language.) Unlike batched rebuilding, it has worst-case bounds and is usable for persistent data structures.

Okazaki’s “lazy rebuilding” employs laziness to execute the batched rebuild after it is “paid for”, all at once. This makes it suitable for persistent data structures. Scheduling can be applied to recover the worst-case bounds.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s