Last fall I spent a lot of time optimizing the internals of FS2. The `Free`

monad is used extensively in FS2 for modeling various algebras – most notably, an algebra supporting `Stream`

and an algebra supporting `Pull`

. Almost any operation a user performs on a `Stream`

or `Pull`

results in one or more bounces on the trampoline inside `Free`

. As such, any optimization in `Free`

pays off big time. In this article, we’ll look at one of these optimizations – *map fusion*.

Operator fusion is the act of condensing successive invocations of an operation in to a single invocation. Hence, map fusion is the act of condensing successive `map`

invocations in to a single `map`

. In the context of `Free`

, map fusion means that for all `fa: Free[F, A]`

, `f: A => B`

, `g: B => C`

, `fa.map(f).map(g)`

should result in the *same internal structure* as `fa.map(f andThen g)`

. This should result in a significant performance improvement for free programs with lots of successive map operations, especially if `map`

is implemented in terms of `flatMap`

and `pure`

, as we can condense multiple bounces on the trampoline in to a single bounce.

## Coyoneda

There’s already a data structure that provides map fusion for an arbitrary type constructor – `Coyoneda`

. Here’s a simplified implementation of `Coyoneda`

based off the one defined in Cats:

```
sealed abstract class Coyoneda[F[_], A] { self =>
type Pivot
val pivot: F[Pivot]
val transform: Pivot => A
def map[B](f: A => B): Coyoneda[F, B] = new Coyoneda[F, B] {
type Pivot = self.Pivot
val pivot = self.pivot
val transform = f compose self.transform
}
def run(implicit F: Functor[F]): F[A] =
F.map(pivot)(transform)
}
object Coyoneda {
def apply[F[_], A, B](fa: F[A], f: A => B): Coyoneda[F, B] { type Pivot = A } =
new Coyoneda[F, B] {
type Pivot = A
val pivot: F[Pivot] = fa
val transform: Pivot => B = f
}
def lift[F[_], A](fa: F[A]): Coyondea[F, A] =
apply[F, A, A](fa, identity)
}
```

`Coyoneda[F, A]`

stores a value of `F[X]`

for some `X`

along with a function from `X => A`

. It forms a `Functor[Coyoneda[F, ?]]`

where mapping results in extending the transformation function by the function passed to `map`

.

In some sense, ** Coyoneda encodes map fusion directly** – it hides away an

`F[X]`

for some `X`

and then incrementally builds a transformation function by composing the functions passed to `map`

. When composition is complete, the `Coyoneda`

is “unwrapped” by mapping the composed transformation over the pivot (`F[X]`

) via a single invocation of `map`

. We won’t go in to detail on the theoretical underpinnings of `Coyoneda`

, but if you’re interested, a great starting point is Free Monads and the Yoneda Lemma by Rúnar Bjarnason.This is pretty powerful – we can get map fusion for any `Functor`

by lifting a value to `Coyoneda`

, passing the coyoneda to functions that operate on a `Functor`

, and then unwrapping via `run`

!

## Embedding Coyoneda in Free

We want to go a bit further with `Free`

and embed map fusion inside the `map`

method. To do this, we can embed a `Coyoneda`

inside `Free`

. This is a very mechanical embedding – the general idea is to add another constructor to `Free`

representing `Map`

(in addition to the standard constructors like `Pure`

, `Suspend`

, and `Bind`

). The `Map`

constructor wraps a `Coyoneda`

. We can then implement the `map`

operation on `Free`

such that it either extends the coyoneda transformation (if `map`

was called on a `Map`

instance) or constructs a new coyoneda (if `map`

was called on any other constructor). So something like this:

```
trait Free[F, A] {
def map[B](f: A => B): Free[F, B] = this match {
case Map(coyoneda) => Map(coyoneda.map(f))
case other => Map(Coyondea[Free[F, ?], A, B](other, f))
}
}
object Free {
case class Map(coyoneda: Coyoneda[F, A]) extends Free[F, A]
}
```

What do we lose by doing this? With our current definition of `Coyoneda`

, we lose stack safety. We normally get stack safety from the monad in which we interpret the `Free`

, with `Free`

re-associating left-nested `flatMap`

s. If `map`

is implemented in terms of `flatMap`

and `pure`

, this works fine. But in the `Coyoneda`

backed version, we’ve lost stack safety due to the dependence on Scala’s function composition in `Coyoneda#map`

.

Consider:

```
val z: Coyoneda[Function0, Int] = Coyoneda(() => 0, identity[Int])
(0 to 10).foldLeft(z)((acc, i) => acc.map(_ + i)).run.apply
// 55
(0 to 10000).foldLeft(z)((acc, i) => acc.map(_ + i)).run.apply
// 50005000
(0 to 100000).foldLeft(z)((acc, i) => acc.map(_ + i)).run.apply
// java.lang.StackOverflowError
// at scala.Function1.$anonfun$compose$1(Function1.scala:44)
// at scala.Function1.$anonfun$compose$1(Function1.scala:44)
// at scala.Function1.$anonfun$compose$1(Function1.scala:44)
```

This is caused by the implementation of `compose`

in `scala.Function1`

:

```
trait Function1[-T1, +R] {
def apply(v1: T1): R
def compose[A](g: A => T1): A => R = { x => apply(g(x)) }
def andThen[A](g: R => A): T1 => A = { x => g(apply(x)) }
}
```

Each call to `compose`

(or `andThen`

) returns a new function that calls `apply`

on the previous function, resulting in a stack frame for each composition.

## Stack Safe Function Composition

Our map fusion optimization hinges on function composition. How can we compose functions in a stack safe way? We can use a *type aligned sequence* – a sequence of functions such that the output type of each function in the sequence is compatible with the input type of the next function in the sequence. We do not need fancy typelevel functions or dependent types to implement this! Rather, we can use a simple cons list.

```
final class ListF1[-A, +B] private (private val fs: List[(Any => Any)]) extends (A => B) {
override def apply(a: A): B =
fs.foldLeft(a: Any)((x, f) => f(x)).asInstanceOf[B]
override def compose[C](g: C => A): ListF1[C, B] =
new ListF1(g.asInstanceOf[Any => Any] :: fs)
}
object ListF1 {
def apply[A, B](f: A => B): ListF1[A, B] = f match {
case f: ListF1[A, B] => f
case _ => new ListF1(f.asInstanceOf[Any => Any] :: Nil)
}
}
```

We represent the type level sequence with a `List[(Any => Any)]`

and we enforce the “type alignment” property via the type signature of `compose`

. The list always has at least one element thanks to `apply`

. Each call to `compose`

results in cons-ing a function on to the type aligned sequence. Finally, we implement `ListF1#apply`

by applying each function in turn, using the output of the previous function as the input to the next.

```
// Normal function composition is not stack safe
(0 until 1000000).foldLeft(identity[Int] _)((acc, i) => acc.compose(_ + 1))(0)
// java.lang.StackOverflowError
// at scala.runtime.java8.JFunction1$mcII$sp.apply(JFunction1$mcII$sp.java:12)
// at scala.Function1.$anonfun$compose$1(Function1.scala:44)
// at scala.Function1.$anonfun$compose$1(Function1.scala:44)
// ListF1 composition is stack safe
(0 until 1000000).foldLeft(ListF1(identity[Int]))((acc, i) => acc.compose(_ + 1))(0)
// 1000000
```

This definition is sufficient to restore stack safety to `Coyoneda`

and `Free`

! We just need to wrap each `Function1`

in a `ListF1`

inside the definition of `Coyoneda#map`

.

## Generalized Stack Safe Function Composition

While `ListF1`

provides a stack safe `compose`

, it inherits the default `andThen`

from `Function1`

. This is very dangerous if we hope to provide a general purpose stack safe function wrapper. Extending `ListF1`

to support both stack safe `compose`

and `andThen`

(and arbitrary interleavings) requires us to store the type aligned sequence in a data structure that has constant time cons and snoc operations. The only standard library collection that supports this is `Vector`

(correction: @nickstanch pointed out that `scala.collection.immutable.Queue`

has O(1) cons and snoc and O(n) traversal – benchmarks below have been updated to include a `Queue`

based implementation).

```
final class VectorF1[-A, +B] private (private val fs: Vector[(Any => Any)]) extends (A => B) {
override def apply(a: A): B =
fs.foldLeft(a: Any)((x, f) => f(x)).asInstanceOf[B]
override def compose[C](g: C => A): VectorF1[C, B] =
new VectorF1(g.asInstanceOf[Any => Any] +: fs)
override def andThen[C](g: B => C): VectorF1[A, C] =
new VectorF1(fs :+ g.asInstanceOf[Any => Any])
}
object VectorF1 {
def apply[A, B](f: A => B): VectorF1[A, B] = f match {
case f: VectorF1[A, B] => f
case _ => new VectorF1(Vector(f.asInstanceOf[Any => Any]))
}
}
```

`VectorF1`

seems perfect. It has stack safe `compose`

and `andThen`

. Unfortunately, it is terribly slow. Most composed functions are going to be made up of a small number of compositions – 2, 3, or 4 functions. It’s uncommon to compose 1,000,000 functions. `Vector`

has effectively constant time cons and snoc, but there are large constant factors. For example, a single element vector will allocate a 32-element array internally. These constant factors end up dominating the performance results, so this solution won’t work.

Fortunately, FS2 has a data structure that performs better – `fs2.util.Catenable`

. It has O(1) cons, snoc, and concat, amortized O(1) uncons, O(n) traversal, and negligible constant factors for small collections.

```
import fs2.util.Catenable
final class CatenableF1[-A, +B] private (private val fs: Catenable[(Any => Any)]) extends (A => B) {
override def apply(a: A): B =
fs.foldLeft(a: Any)((x, f) => f(x)).asInstanceOf[B]
override def compose[C](g: C => A): CatenableF1[C, B] =
new CatenableF1(g.asInstanceOf[Any => Any] +: fs)
override def andThen[C](g: B => C): CatenableF1[A, C] =
new CatenableF1(fs :+ g.asInstanceOf[Any => Any])
}
object CatenableF1 {
def apply[A, B](f: A => B): CatenableF1[A, B] = f match {
case f: CatenableF1[A, B] => f
case _ => new CatenableF1(Catenable.single(f.asInstanceOf[Any => Any]))
}
}
```

### Stack Safe Function Benchmarks

Let’s confirm our performance assumptions with a JMH benchmark:

```
@State(Scope.Thread)
class FunctionBenchmark {
val f = identity[Int] _
@Benchmark def unitary_Function1 = f(0)
@Benchmark def unitary_ListF1 = ListF1(f)(0)
@Benchmark def unitary_VectorF1 = VectorF1(f)(0)
@Benchmark def unitary_CatenableF1 = CatenableF1(f)(0)
@Benchmark def unitary_QueueF1 = QueueF1(f)(0)
@Benchmark def compose2_Function1 = (f compose f)(0)
@Benchmark def compose2_ListF1 = (ListF1(f) compose f)(0)
@Benchmark def compose2_VectorF1 = (VectorF1(f) compose f)(0)
@Benchmark def compose2_CatenableF1 = (CatenableF1(f) compose f)(0)
@Benchmark def compose2_QueueF1 = (QueueF1(f) compose f)(0)
@Benchmark def andThen2_Function1 = (f andThen f)(0)
@Benchmark def andThen2_ListF1 = (ListF1(f) andThen f)(0)
@Benchmark def andThen2_VectorF1 = (VectorF1(f) andThen f)(0)
@Benchmark def andThen2_CatenableF1 = (CatenableF1(f) andThen f)(0)
@Benchmark def andThen2_QueueF1 = (QueueF1(f) andThen f)(0)
@Benchmark def compose1k_Function1 = (0 until 1000).foldLeft(f)((acc, _) => acc compose f)(0)
@Benchmark def compose1k_ListF1 = (0 until 1000).foldLeft(ListF1(f))((acc, _) => acc compose f)(0)
@Benchmark def compose1k_VectorF1 = (0 until 1000).foldLeft(VectorF1(f))((acc, _) => acc compose f)(0)
@Benchmark def compose1k_CatenableF1 = (0 until 1000).foldLeft(CatenableF1(f))((acc, _) => acc compose f)(0)
@Benchmark def compose1k_QueueF1 = (0 until 1000).foldLeft(QueueF1(f))((acc, _) => acc compose f)(0)
@Benchmark def andThen1k_Function1 = (0 until 1000).foldLeft(f)((acc, _) => acc andThen f)(0)
@Benchmark def andThen1k_ListF1 = (0 until 1000).foldLeft(ListF1(f): Int => Int)((acc, _) => acc andThen f)(0)
@Benchmark def andThen1k_VectorF1 = (0 until 1000).foldLeft(VectorF1(f))((acc, _) => acc andThen f)(0)
@Benchmark def andThen1k_CatenableF1 = (0 until 1000).foldLeft(CatenableF1(f))((acc, _) => acc andThen f)(0)
@Benchmark def andThen1k_QueueF1 = (0 until 1000).foldLeft(QueueF1(f))((acc, _) => acc andThen f)(0)
@Benchmark def interleaved1k_Function1 = (0 until 1000).foldLeft(f)((acc, i) => if (i % 2 == 0) acc compose f else acc andThen f)(0)
@Benchmark def interleaved1k_ListF1 = (0 until 1000).foldLeft(ListF1(f): Int => Int)((acc, i) => if (i % 2 == 0) acc compose f else acc andThen f)(0)
@Benchmark def interleaved1k_VectorF1 = (0 until 1000).foldLeft(VectorF1(f))((acc, i) => if (i % 2 == 0) acc compose f else acc andThen f)(0)
@Benchmark def interleaved1k_CatenableF1 = (0 until 1000).foldLeft(CatenableF1(f))((acc, i) => if (i % 2 == 0) acc compose f else acc andThen f)(0)
@Benchmark def interleaved1k_QueueF1 = (0 until 1000).foldLeft(QueueF1(f))((acc, i) => if (i % 2 == 0) acc compose f else acc andThen f)(0)
}
```

Running these tests via `jmh:run -i 20 -wi 10 -f1 -t2`

results in:

```
[info] Benchmark Mode Cnt Score Error Units
[info] FunctionBenchmark.andThen1k_CatenableF1 thrpt 20 45571.159 ± 2335.498 ops/s
[info] FunctionBenchmark.andThen1k_Function1 thrpt 20 126595.146 ± 9753.267 ops/s
[info] FunctionBenchmark.andThen1k_ListF1 thrpt 20 122326.955 ± 10781.655 ops/s
[info] FunctionBenchmark.andThen1k_QueueF1 thrpt 20 37697.152 ± 3168.007 ops/s
[info] FunctionBenchmark.andThen1k_VectorF1 thrpt 20 45148.023 ± 442.307 ops/s
[info] FunctionBenchmark.andThen2_CatenableF1 thrpt 20 32498316.581 ± 785428.217 ops/s
[info] FunctionBenchmark.andThen2_Function1 thrpt 20 797337943.963 ± 21834869.821 ops/s
[info] FunctionBenchmark.andThen2_ListF1 thrpt 20 155599638.602 ± 4425324.044 ops/s
[info] FunctionBenchmark.andThen2_QueueF1 thrpt 20 14540756.592 ± 123320.145 ops/s
[info] FunctionBenchmark.andThen2_VectorF1 thrpt 20 14688247.077 ± 131961.840 ops/s
[info] FunctionBenchmark.compose1k_CatenableF1 thrpt 20 69420.052 ± 1235.404 ops/s
[info] FunctionBenchmark.compose1k_Function1 thrpt 20 152949.881 ± 1692.388 ops/s
[info] FunctionBenchmark.compose1k_ListF1 thrpt 20 71636.332 ± 1137.961 ops/s
[info] FunctionBenchmark.compose1k_QueueF1 thrpt 20 51920.074 ± 428.416 ops/s
[info] FunctionBenchmark.compose1k_VectorF1 thrpt 20 40058.457 ± 4437.660 ops/s
[info] FunctionBenchmark.compose2_CatenableF1 thrpt 20 27988815.099 ± 2636173.720 ops/s
[info] FunctionBenchmark.compose2_Function1 thrpt 20 789010032.967 ± 28815277.968 ops/s
[info] FunctionBenchmark.compose2_ListF1 thrpt 20 95572290.279 ± 1972247.968 ops/s
[info] FunctionBenchmark.compose2_QueueF1 thrpt 20 20448278.219 ± 224039.432 ops/s
[info] FunctionBenchmark.compose2_VectorF1 thrpt 20 12204863.300 ± 155372.662 ops/s
[info] FunctionBenchmark.interleaved1k_CatenableF1 thrpt 20 47635.076 ± 556.602 ops/s
[info] FunctionBenchmark.interleaved1k_Function1 thrpt 20 129433.210 ± 1434.155 ops/s
[info] FunctionBenchmark.interleaved1k_ListF1 thrpt 20 134112.681 ± 2443.776 ops/s
[info] FunctionBenchmark.interleaved1k_QueueF1 thrpt 20 43697.844 ± 1693.528 ops/s
[info] FunctionBenchmark.interleaved1k_VectorF1 thrpt 20 30688.994 ± 512.420 ops/s
[info] FunctionBenchmark.unitary_CatenableF1 thrpt 20 48579687.093 ± 1281930.212 ops/s
[info] FunctionBenchmark.unitary_Function1 thrpt 20 809718632.059 ± 6557132.444 ops/s
[info] FunctionBenchmark.unitary_ListF1 thrpt 20 172189813.428 ± 3269454.794 ops/s
[info] FunctionBenchmark.unitary_QueueF1 thrpt 20 22289854.985 ± 500526.148 ops/s
[info] FunctionBenchmark.unitary_VectorF1 thrpt 20 18913195.601 ± 502848.724 ops/s
```

Based on these tests, `CatenableF1`

is about 3x faster than `VectorF1`

for small compositions and about 1.5x - 2x `VectorF1`

for large collections. `CatenableF1`

is about 3x slower than `ListF1`

and about 14x slower than `Function1`

for small compositions and about equal to `ListF1`

and 3x slower than `Function1`

for large compositions. `CatenableF1`

beats `QueueF1`

which beats `VectorF1`

.

Not bad but stack safety comes at a runtime performance cost. One area for investigation is a dynamic algorithm that switches from `Function1`

to `CatenableF1`

when the composition stack reaches a certain depth.

## Conclusion

If you check the FS2 source, you might notice that `Free`

does not do map fusion. We ended up not merging map fusion in to FS2 because we were able to get comparable performance gains with a simpler optimization that covers more use cases. We now eagerly evaluate `Bind(Pure(a), f)`

structures when stepping a `Free`

, which gives us the same performance boost as map fusion but covers more use cases and avoids the (minor) constant factor performance hit of `CatenableF1`

.